Probability and Statistics with R, Second Edition (Solutions, Instructor Solution Manual) [2 ed.] 9781466504431, 9781466504394, 1466504390

138 98 17MB

English Pages 682 Year 2015

Recommend Papers

Probability: A Graduate Course (First and Second Edition Solutions, Instructor Solution Manual) 1441919856, 9781441919854

only the first edition comes with a solution manual from Springer, but the problems contained in the second edition (pub

109 102 569KB Read more

Using R for Introductory Statistics, Second Edition (Solutions, Instructor Solution Manual) [2 ed.] 9781466590779, 9781466590731, 1466590734

105 94 1MB Read more

Probability and Statistics with Reliability, Queueing, and Computer Science Applications, Second Edition (Suppl. 1 of 2, Instructor Solution Manual, Solutions) [2 ed.] 9780471333418, 0471333417

101 7 2MB Read more

Understanding Statistics Using R (Instructor Solution Manual, Solutions) [1 ed.] 1461462266, 9781461462262

166 11 642KB Read more

Applied Probability and Stochastic Processes, Second Edition (Solutions, Instructor Solution Manual) [2 ed.] 9781482257687, 9780367658496, 0367658496

103 10 2MB Read more

Fundamentals of Probability: With Stochastic Processes, Fourth Edition (Complete Instructor Resources with Solution Manual, Solutions) [4 ed.] 1498755097, 9781498755092

109 61 4MB Read more

Introducing Monte Carlo Methods with R (Instructor Solution Manual, Solutions) [1 ed.] 1441915753, 9781441915757

188 8 1MB Read more

Matrix Analysis for Statistics, Third Edition (Instructor Solution Manual, Solutions) [3 ed.] 1119092485, 9781119092483

192 99 1MB Read more

Sampling: Design and Analysis (Solutions, Instructor Solution Manual) Second Edition [2 ed.] 0367273411, 9780367273415

119 105 1MB Read more

Statistics Explained: An Introductory Guide for Life Scientists, Second Edition (Solutions, Instructor Solution Manual) [2 ed.] 0521183286, 9780521183284

100 56 68KB Read more

Probability and Statistics with R, Second Edition (Solutions, Instructor Solution Manual) [2 ed.]
9781466504431, 9781466504394, 1466504390

Author / Uploaded
Maria Dolores Ugarte
Ana F. Militino
Alan T. Arnholt

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

solutions MANUAL FOR Probability and Statistics with R, Second Edition by

María Dolores Ugarte, Ana F. Militino, and Alan T. Arnholt

K14521_SM-Color_Cover.indd 1

30/06/15 11:46 am

K14521_SM-Color_Cover.indd 2

30/06/15 11:46 am

solutionS MANUAL FOR Probability and Statistics with R, Second Edition by

María Dolores Ugarte, Ana F. Militino, and Alan T. Arnholt

Boca Raton London New York

CRC Press is an imprint of the Taylor & Francis Group, an informa business

K14521_SM-Color_Cover.indd 3

30/06/15 11:46 am

CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20150619 International Standard Book Number-13: 978-1-4665-0443-1 (Ancillary) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com

K14521_SM-Color_Cover.indd 4

30/06/15 11:46 am

Contents

1 What Is R?

1

2 Exploring Data

27

3 General Probability and Random Variables

79

4 Univariate Probability Distributions

121

5 Multivariate Probability Distributions

171

6 Sampling and Sampling Distributions

209

7 Point Estimation

245

8 Confidence Intervals

307

9 Hypothesis Testing

345

10 Nonparametric Methods

429

11 Experimental Design

477

12 Regression

541

Bibliography

661

Index

671

v

K14521_SM-Color_Cover.indd 5

30/06/15 11:46 am

K14521_SM-Color_Cover.indd 6

30/06/15 11:46 am

Chapter 1 What Is R?

1. Calculate the following numerical results to three decimal places with R: √ (a) (7 − 8) + 53 − 5 ÷ 6 + 62 √ (b) ln 3 + 2 sin(π) − e3 √ (c) 2 × (5 + 3) − 6 + 92 (d) ln(5) − exp(2) + 23 √ (e) (9 ÷ 2) × 4 − 10 + ln(6) − exp(1) Solution: (a) 131.041 > round((7 - 8) + 5^3 - 5/6 + sqrt(62), 3) [1] 131.041 (b) -18.987 > round(log(3) - sqrt(2) * sin(pi) - exp(3), 3) [1] -18.987 (c) 94.551 > round(2 * (5 + 3) - sqrt(6) + 9^2, 3) [1] 94.551 (d) 2.22 > round(log(5) - exp(2) + 2^3, 3) [1] 2.22 (e) 13.911 > round(9/2 * 4 - sqrt(10) + log(6) - exp(1), 3) [1] 13.911

2. Create a vector named countby5 that is a sequence of 5 to 100 in steps of 5. Solution:

1

K14521_SM-Color_Cover.indd 7

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

2

> countby5 countby5 [1] [18]

5 90

10 15 95 100

20

25

30

35

40

45

50

55

60

65

70

75

80

85

3. Create a vector named Treatment with the entries “Treatment One” appearing 20 times, “Treatment Two” appearing 18 times, and “Treatment Three” appearing 22 times. Solution: > Treatment xtabs(~Treatment) Treatment Treatment One Treatment Three 20 22

Treatment Two 18

4. Provide the missing values in rep(seq( 15, 10, 10, 10, 5, 5, 5, 5.

,

,

),

) to create the sequence 20, 15,

Solution: > rep(seq(from = 20, to = 5, by = -5), times = 1:4) [1] 20 15 15 10 10 10

5

5

5

5

5. Vectors, sequences, and logical operators (a) Assign the names x and y to the values 5 and 7, respectively. Find xy and assign the result to z. What is the valued stored in z? (b) Create the vectors u = (1, 2, 5, 4) and v = (2, 2, 1, 1) using the c() function. (c) Provide R code to find which component of u is equal to 5. (d) Provide R code to give the components of v greater than or equal to 2. (e) Find the product u × v. How does R perform the operation? (f) Explain what R does when two vectors of unequal length are multiplied together. Specifically, what is u × c(u, v)? (g) Provide R code to define a sequence from 1 to 10 called G and subsequently to select the first three components of G. (h) Use R to define a sequence from 1 to 30 named J with an increment of 2 and subsequently to choose the first, third, and eighth values of J. (i) Calculate the scalar product (dot product) of q = (3, 0, 1, 6) by r = (1, 0, 2, 4).

K14521_SM-Color_Cover.indd 8

30/06/15 11:46 am

Chapter 1:

What Is R?

3

(j) Define the matrix X whose rows are the u and v vectors from part (b). (k) Define the matrix Y whose columns are the u and v vectors from part (b). (l) Find the matrix product of X by Y and name it W. (m) Provide R code that computes the inverse matrix of W and the transpose of that inverse. Solution: (a) The valued stored in z is 78125. > > > >

x = 2) [1] 1 2 (e) Multiplication of vectors with R is element by element. > uv uv [1] 2 4 5 4 (f) The values in the shorter vector are recycled until the two vectors are the same size. In this case, u*c(u, v) is the same as c(u, u)*c(u, v). > u * (c(u, v)) [1]

1

4 25 16

2

4

5

4

4

5

4

> c(u, u) * c(u, v) [1]

1

4 25 16

2

(g)

K14521_SM-Color_Cover.indd 9

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

4

> G G[1:3] [1] 1 2 3 (h) > J J[c(1, 3, 8)] [1]

1

5 15

(i) > q r q %*% r [1,]

[,1] 29

(j) > X X u v

[,1] [,2] [,3] [,4] 1 2 5 4 2 2 1 1

(k) > Y Y [1,] [2,] [3,] [4,]

u 1 2 5 4

v 2 2 1 1

(l) > W W u v u 46 15 v 15 10 (m) > solve(W) u v u 0.04255319 -0.06382979 v -0.06382979 0.19574468

K14521_SM-Color_Cover.indd 10

30/06/15 11:46 am

Chapter 1:

What Is R?

5

> t(solve(W)) u v u 0.04255319 -0.06382979 v -0.06382979 0.19574468

6. How many of the apartments in the VIT2005 data frame, part of the PASWR2 package, have a totalprice greater than e 400,000 and also have a garage? Use a single line of R code to determine the answer. Solution: There are 6 apartments with a totalprice greater than e 400,000 that also have a garage.

> dim(VIT2005[VIT2005$totalprice >=400000 & VIT2005$garage >=1, ])[1] [1] 6

7. Wheat harvested surface in Spain in 2004: Figure 1.1, made with R, depicts the autonomous communities in Spain. The Wheat Table that follows gives the wheat harvested surfaces in 2004 by autonomous communities in Spain measured in hectares. Provide R code to answer all the questions.

Asturias

Cantabria Pais Vasco

Galicia

Navarra

Rioja Castilla−Leon

Aragon

Cataluna

Madrid

Extremadura

Castilla−la Mancha Communidad Valenciana

Baleares

Murcia Andalucia

Canarias

FIGURE 1.1: Autonomous communities in Spain

K14521_SM-Color_Cover.indd 11

30/06/15 11:46 am

6

Probability and Statistics with R, Second Edition: Exercises and Solutions Wheat Table community Galicia Asturias Cantabria Pa´ıs Vasco Navarra La Rioja Arag´ on Catalu˜ na Islas Baleares

wheat.surface

community

wheat.surface

18817 Castilla y Le´on 65 Madrid 440 Castilla-La Mancha 25143 C. Valenciana 66326 Regi´on de Murcia 34214 Extremadura 311479 Andaluc´ıa 74206 Islas Canarias 7203

619858 13118 263424 6111 9500 143250 558292 100

(a) Create the variables community and wheat.surface from the Wheat Table in this problem. Store both variables in a data.frame named wheatspain. (b) Find the maximum, the minimum, and the range for the variable wheat.surface. (c) Which community has the largest harvested wheat surface? (d) Sort the autonomous communities by harvested surface in ascending order. (e) Sort the autonomous communities by harvested surfaces in descending order. (f) Create a new file called wheat.c where Asturias has been removed. (g) Add Asturias back to the file wheat.c. (h) Create in wheat.c a new variable called acre indicating the harvested surface in acres (1 acre = 0.40468564224 hectares). (i) What is the total harvested surface in hectares and in acres in Spain in 2004? (j) Define in wheat.c the row.names() using the names of the communities. Remove the community variable from wheat.c. (k) What percent of the autonomous communities have a harvested wheat surface greater than the mean wheat surface area? (l) Sort wheat.c by autonomous communities’ names (row.names()). (m) Determine the communities with less than 40,000 acres of harvested surface and find their total harvested surface in hectares and acres. (n) Create a new file called wheat.sum where the autonomous communities that have less than 40,000 acres of harvested surface are consolidated into a single category named “less than 40,000” with the results from (m). (o) Use the function dump() on wheat.c, storing the results in a new file named wheat.txt. Remove wheat.c from your path and check that you can recover it from wheat.txt. (p) Create a text file called wheat.dat from the wheat.sum file using the command write.table(). Explain the differences between wheat.txt and wheat.dat. (q) Use the command read.table() to read the file wheat.dat.

K14521_SM-Color_Cover.indd 12

30/06/15 11:46 am

Chapter 1:

What Is R?

7

Solution: (a) > + + + > + + > > >

community diff(range(wheat.spain$wheat.surface)) [1] 619793 (c) > wheat.spain[wheat.spain$wheat.surface == max(wheat.spain$wheat.surface),] community wheat.surface 10 Castilla y Leon 619858 (d) > IO head(IO) 2 17 3

K14521_SM-Color_Cover.indd 13

community wheat.surface Asturias 65 Islas Canarias 100 Cantabria 440

30/06/15 11:46 am

8

Probability and Statistics with R, Second Edition: Exercises and Solutions

13 C. Valenciana 9 Islas Baleares 14 Region de Murcia

6111 7203 9500

(e) > DO head(DO) community wheat.surface 10 Castilla y Leon 619858 16 Andalucia 558292 7 Aragon 311479 12 Castilla-La Mancha 263424 15 Extremadura 143250 8 Cataluna 74206 (f) > wheat.c head(wheat.c) community wheat.surface 1 Galicia 18817 3 Cantabria 440 4 Pais Vasco 25143 5 Navarra 66326 6 La Rioja 34214 7 Aragon 311479 (g) > RM wheat.c wheat.c community wheat.surface 1 Galicia 18817 3 Cantabria 440 4 Pais Vasco 25143 5 Navarra 66326 6 La Rioja 34214 7 Aragon 311479 8 Cataluna 74206 9 Islas Baleares 7203 10 Castilla y Leon 619858 11 Madrid 13118 12 Castilla-La Mancha 263424 13 C. Valenciana 6111 14 Region de Murcia 9500 15 Extremadura 143250 16 Andalucia 558292 17 Islas Canarias 100 2 Asturias 65

K14521_SM-Color_Cover.indd 14

30/06/15 11:46 am

Chapter 1:

What Is R?

9

(h) > wheat.c sum(wheat.c$wheat.surface) [1] 2151546 > sum(wheat.c$acre) [1] 5316586 (j) > > > >

nc AO head(AO)

K14521_SM-Color_Cover.indd 15

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

10

Andalucia Aragon Asturias C. Valenciana Cantabria Castilla y Leon

wheat.surface acre 558292 1379569.5763 311479 769681.3711 65 160.6185 6111 15100.6099 440 1087.2637 619858 1531702.4755

(m) The total harvested area is 36537 acres or 90284.8932 hectares. > lessthan40k lessthan40k Cantabria Islas Baleares Madrid C. Valenciana Region de Murcia Islas Canarias Asturias

wheat.surface 440 7203 13118 6111 9500 100 65

acre 1087.2637 17799.0006 32415.2839 15100.6099 23475.0112 247.1054 160.6185

> apply(lessthan40k, 2, sum) wheat.surface 36537.00

acre 90284.89

(n) > > > > >

lt40 wheat.c # no longer available

K14521_SM-Color_Cover.indd 16

30/06/15 11:46 am

Chapter 1: Error in eval(expr, envir, enclos):

What Is R?

11

object ’wheat.c’ not found

> source("wheat.txt") > head(wheat.c) Galicia Cantabria Pais Vasco Navarra La Rioja Aragon

wheat.surface acre 18817 46497.820 440 1087.264 25143 62129.706 66326 163895.115 34214 84544.635 311479 769681.371

(p) There are different values stored in each of wheat.txt and wheat.dat. Specifically, the values from part (m) are collapsed into one category "less than 40,000" in wheat.dat, whereas wheat.txt has all of the values. > write.table(x = wheat.sum, file = "wheat.dat") (q) > tail(read.table(file = "wheat.dat")) Cataluna Castilla y Leon Castilla-La Mancha Extremadura Andalucia less than 40,000

wheat.surface acre 74206 183367.02 619858 1531702.48 263424 650934.88 143250 353978.46 558292 1379569.58 36537 90284.89

8. Access the data from url http://www.stat.berkeley.edu/users/statlabs/data/babies.data and store the information in an object named BABIES using the function read.table(). A description of the variables can be found at http://www.stat.berkeley.edu/users/statlabs/labs.html. These data are a subset from a much larger study dealing with child health and development. (a) The variables bwt, gestation, parity, age, height, weight, and smoke use values of 999, 999, 9, 99, 99, 999, and 9, respectively, to denote “unknown.” R uses NA to denote a missing or unavailable value. Recode the missing values in BABIES. Hint: use something similar to BABIES$bwt[BABIES$bwt == 999] = NA. (b) Use the function na.omit() to create a “clean” data set that removes subjects if any observations on the subject are “unknown.” Store the modified data frame in a data frame named CLEAN. (c) How many missing values are there for gestation, age, height, weight, and smoke, respectively? How many rows of BABIES have no missing values, one missing value, two missing values, and three missing values, respectively? Note: the number of rows in CLEAN should agree with your answer for the number of rows in BABIES that have no missing values.

K14521_SM-Color_Cover.indd 17

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

12

(d) Use the function complete.cases() to create a “clean” data set that removes subjects if any observations on the subject are “unknown.” Store the modified data frame in a data frame named CLEAN2. Write a line of code that shows all of the values in CLEAN are the same as those in CLEAN2. (e) Sort the values in CLEAN by bwt, gestation, and age. Store the sorted values in a data frame named BGA and show the last six rows. (f) Store the data frame CLEAN in your working directory as a *.csv file. (g) What percent of the women in CLEAN are pregnant with their first child (parity = 0) and do not smoke? Solution: (a) > site BABIES summary(BABIES) bwt Min. : 55.0 1st Qu.:108.8 Median :120.0 Mean :119.6 3rd Qu.:131.0 Max. :176.0 height Min. :53.00 1st Qu.:62.00 Median :64.00 Mean :64.67 3rd Qu.:66.00 Max. :99.00

gestation Min. :148.0 1st Qu.:272.0 Median :280.0 Mean :286.9 3rd Qu.:288.0 Max. :999.0 weight Min. : 87 1st Qu.:115 Median :126 Mean :154 3rd Qu.:140 Max. :999

parity Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.2549 3rd Qu.:1.0000 Max. :1.0000 smoke Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4644 3rd Qu.:1.0000 Max. :9.0000

age Min. :15.00 1st Qu.:23.00 Median :26.00 Mean :27.37 3rd Qu.:31.00 Max. :99.00

> dim(BABIES) [1] 1236 > > > > > > > >

7

BABIES$bwt[BABIES$bwt == 999] = NA BABIES$gestation[BABIES$gestation == 999] = NA BABIES$parity[BABIES$parity == 9] = NA BABIES$age[BABIES$age == 99] = NA BABIES$height[BABIES$height == 99] = NA BABIES$weight[BABIES$weight == 999] = NA BABIES$smoke[BABIES$smoke == 999] = NA summary(BABIES) bwt Min. : 55.0 1st Qu.:108.8 Median :120.0

K14521_SM-Color_Cover.indd 18

gestation Min. :148.0 1st Qu.:272.0 Median :280.0

parity Min. :0.0000 1st Qu.:0.0000 Median :0.0000

age Min. :15.00 1st Qu.:23.00 Median :26.00

30/06/15 11:46 am

Chapter 1: Mean :119.6 3rd Qu.:131.0 Max. :176.0 height Min. :53.00 1st Qu.:62.00 Median :64.00 Mean :64.05 3rd Qu.:66.00 Max. :72.00 NA's :22

Mean :279.3 3rd Qu.:288.0 Max. :353.0 NA's :13 weight Min. : 87.0 1st Qu.:114.8 Median :125.0 Mean :128.6 3rd Qu.:139.0 Max. :250.0 NA's :36

What Is R?

Mean :0.2549 3rd Qu.:1.0000 Max. :1.0000

13 Mean :27.26 3rd Qu.:31.00 Max. :45.00 NA's :2

smoke Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4644 3rd Qu.:1.0000 Max. :9.0000

> dim(BABIES) [1] 1236

7

(b) > CLEAN dim(CLEAN) [1] 1184

7

(c) There are 13 missing values for gestation, 2 missing values for age, and 36 missing values for weight. There are 1184 rows of BABIES with no missing values, 33 rows of BABIES with one missing value, 17 rows of BABIES with two missing values, and 2 rows of BABIES with three missing values. > summary(BABIES) bwt Min. : 55.0 1st Qu.:108.8 Median :120.0 Mean :119.6 3rd Qu.:131.0 Max. :176.0 height Min. :53.00 1st Qu.:62.00 Median :64.00 Mean :64.05 3rd Qu.:66.00 Max. :72.00 NA's :22

gestation Min. :148.0 1st Qu.:272.0 Median :280.0 Mean :279.3 3rd Qu.:288.0 Max. :353.0 NA's :13 weight Min. : 87.0 1st Qu.:114.8 Median :125.0 Mean :128.6 3rd Qu.:139.0 Max. :250.0 NA's :36

parity Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.2549 3rd Qu.:1.0000 Max. :1.0000

age Min. :15.00 1st Qu.:23.00 Median :26.00 Mean :27.26 3rd Qu.:31.00 Max. :45.00 NA's :2

smoke Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4644 3rd Qu.:1.0000 Max. :9.0000

> table(apply(is.na(BABIES), 1, sum)) 0 1184

K14521_SM-Color_Cover.indd 19

1 33

2 17

3 2

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

14 (d)

> CLEAN CLEAN2 sum(CLEAN != CLEAN2) [1] 0 (e) > BGAo BGA tail(BGA) 595 240 557 1100 748 633

bwt gestation parity age height weight smoke 170 303 1 21 64 129 0 173 293 0 30 63 110 0 174 281 0 37 67 155 0 174 284 0 39 65 163 0 174 288 0 25 61 182 0 176 293 1 19 68 180 0

(f) > write.csv(CLEAN, file = "CLEAN.csv") (g) 44.3412% of the women in CLEAN are pregnant with their first child and do not smoke. > xtabs(~parity + smoke, data = CLEAN) smoke parity 0 1 0 525 341 1 190 118

9 10 0

> prop.table(xtabs(~parity + smoke, data = CLEAN))[1,1]*100 [1] 44.34122

9. The data frame WHEATUSA2004 from the PASWR2 package has the USA wheat harvested crop surfaces in 2004 by states. It has two variables, states for the state and acres for thousands of acres. (a) Use the function row.names() to define the states as the row names for the data frame WHEATUSA2004 . (b) Define a new variable called ha for the surface area given in hectares where 1 acre = 0.40468564224 hectares. (c) Sort the file according to the harvested surface area in acres. (d) Which states fall in the top 10% of states for harvested surface area?

K14521_SM-Color_Cover.indd 20

30/06/15 11:46 am

Chapter 1:

What Is R?

15

(e) Save the contents of WHEATUSA2004 in a new file called WHEATUSA.txt in your favorite directory. Then, remove WHEATUSA2004 from your workspace, and check that the contents of WHEATUSA2004 can be recovered from WHEATUSA.txt. (f) Use the command write.table() to store the contents of WHEATUSA2004 in a file with the name WHEATUSA.dat. Explain the differences between storing WHEATUSA2004 using dump() and using write.table(). (g) Find the total harvested surface area in acres for the bottom 10% of the states. Solution: (a) > STATES row.names(WHEATUSA2004) head(WHEATUSA2004) AR CA CO DE GA ID

states acres AR 620 CA 320 CO 1700 DE 47 GA 190 ID 700

(b) > WHEATUSA2004$ha head(WHEATUSA2004) AR CA CO DE GA ID

states acres ha AR 620 250.90510 CA 320 129.49941 CO 1700 687.96559 DE 47 19.02023 GA 190 76.89027 ID 700 283.27995

(c) > io head(io) DE NY MS PA MD SC

states acres ha DE 47 19.02023 NY 100 40.46856 MS 135 54.63256 PA 135 54.63256 MD 145 58.67942 SC 180 72.84342

(d) Kansas, Oklahoma, and Texas are in the top 10% of states for harvested surface area.

K14521_SM-Color_Cover.indd 21

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

16

> top10 ans top10, ] > row.names(ans) [1] "KS" "OK" "TX" (e) > > > >

dump("WHEATUSA2004", "WHEATUSA.txt") rm(WHEATUSA2004) source("WHEATUSA.txt") head(WHEATUSA2004)

AR CA CO DE GA ID

states acres ha AR 620 250.90510 CA 320 129.49941 CO 1700 687.96559 DE 47 19.02023 GA 190 76.89027 ID 700 283.27995

(f) This question needs an answer! > write.table(WHEATUSA2004, "WHEATUSA.dat") (g) The total harvested area for the bottom 10% of states is 147 acres. > bottom10 ans ans DE NY

states acres ha DE 47 19.02023 NY 100 40.46856

> THA THA # Total Harvested Acres [1] 147

10. Use the data frame VIT2005 in the PASWR2 package, which contains data on the 218 used apartments sold in Vitoria (Spain) in 2005 to answer the following questions. A description of the variables can be obtained from the help file for this data frame. (a) Create a table of the number of apartments according to the number of garages. (b) Find the mean of totalprice according to the number of garages. (c) Create a frequency table of apartments using the categories: number of garages and number of elevators. (d) Find the mean flat price (total price) for each of the cells of the table created in part (c).

K14521_SM-Color_Cover.indd 22

30/06/15 11:46 am

Chapter 1:

What Is R?

17

(e) What command will select only the apartments having at least one garage? (f) Define a new file called data.c with the apartments that have category = "3B" and have an elevator. (g) Find the mean of totalprice and the mean of area using the information in data.c. Solution: (a) > xtabs(~garage, data = VIT2005) garage 0 1 167 49

2 2

(b) The average price for an apartment with no garage, one garage, and two garages is e 260537.4385, e 345987.7551, and e 369250, respectively. > tapply(VIT2005$totalprice, list(VIT2005$garage), mean) 0 1 2 260537.4 345987.8 369250.0 (c) > xtabs(~garage + elevator, data = VIT2005) elevator garage 0 1 0 44 123 1 0 49 2 0 2 (d) > tapply(VIT2005$totalprice, list(VIT2005$garage, VIT2005$elevator), + mean) 0 1 0 210492.1 278439.8 1 NA 345987.8 2 NA 369250.0 (e) > atleastonegarage = 1, ] > dim(atleastonegarage) [1] 51 15 (f)

K14521_SM-Color_Cover.indd 23

30/06/15 11:46 am

18

Probability and Statistics with R, Second Edition: Exercises and Solutions

> data.c dim(data.c) [1] 62 15 (g) The mean of totalprice and the mean of area using the data frame data.c are e 287564.5161 and 88.9524 m2 , respectively. > MeanTotalPrice MeanArea c(MeanTotalPrice, MeanArea) [1] 287564.51613

88.95242

11. Use the data frame EPIDURALF to answer the following questions: (a) How many patients have been treated with the Hamstring Stretch? (b) What percent of the patients treated with Hamstring Stretch were classified as each of Easy, Difficult, and Impossible? (c) What percent of the patients classified as Easy to palpate were assigned to the Traditional Sitting position? (d) What is the mean weight for each cell in a contingency table created with the variables Ease and Treatment? (e) What percent of the patients have a body mass index (BMI= kg/(cm/100)2 ) less than 25 and are classified as Easy to palpate? Solution: (a) A total of 171 patients have been treated with hamstring stretch position. > xtabs(~treatment, data = EPIDURALF) treatment Hamstring Stretch Traditional Sitting 171 171 > xtabs(~treatment, data = EPIDURALF)[1] Hamstring Stretch 171 (b) The percent of patients treated with hamstring stretch that were classified as Easy, Difficult, and Impossible was 58.4795%, 36.8421%, and 4.6784%, respectively. > T1 T1

K14521_SM-Color_Cover.indd 24

30/06/15 11:46 am

Chapter 1:

What Is R?

19

ease treatment Difficult Easy Impossible Hamstring Stretch 63 100 8 Traditional Sitting 51 107 13 > prop.table(T1[1, ]) * 100 Difficult 36.842105

Easy Impossible 58.479532 4.678363

(c) 51.6908% of the patients classified as easy to palpate were assigned to the traditional sitting position. > T1 ease treatment Difficult Easy Impossible Hamstring Stretch 63 100 8 Traditional Sitting 51 107 13 > prop.table(T1[, "Easy"])[2] * 100 Traditional Sitting 51.69082 (d) > tapply(EPIDURALF$kg, list(EPIDURALF$ease, EPIDURALF$treatment), + mean) Difficult Easy Impossible

Hamstring Stretch Traditional Sitting 92.66667 94.27451 78.67000 79.40187 127.87500 113.61538

(e) 9.0643% of patients have a body mass index less than 25 and are classified as easy to palpate. > EPIDURALF$BMI EPIDURALF[1:5, 3:8] 1 2 3 4 5

cm ease treatment oc complications BMI 172 Difficult Traditional Sitting 0 None 39.21038 176 Easy Hamstring Stretch 0 None 27.76343 157 Difficult Traditional Sitting 0 None 29.21011 169 Easy Hamstring Stretch 2 None 22.05805 163 Impossible Traditional Sitting 0 None 42.90715

> mean(EPIDURALF$ease =="Easy" & EPIDURALF$BMI + > > + + > >

Nationality MetricToEnglish

picker + + + + + + > >

pickerM 6) stop("m must be less than 6") n >

IRF + + + + + + + + + > >

A · ni i n·t n

1+

i n·t n i n

−1

i n·t i A· =R 1+ −1 n n =R

−1

ARF library(MASS) > help(package = "MASS") (b) The description file says lqs() fits a regression to the points in the data set, thereby achieving a regression estimator with a high breakdown point. (c) The function search() provides a list of attached packages and the function library() shows all installed packages. 2. Load Cars93 from the MASS package. (a) Create density histograms for the variables Min.Price, Max.Price, Weight, and Length variables using a different color for each histogram. (b) Superimpose estimated density curves over the histograms. (c) Use the bwplot() function from lattice to create a box and whiskers plot of Price for every type of vehicle according to the drive train. Do you observe any differences between prices? (d) Create a graph similar to the one created in (c) using functions from ggplot2. Solution: (a) and (b) use the following code: > p1 p2 p3 p4 multiplot(p1, p2, p3, p4, layout = matrix(c(1, 2, 3, 4), + byrow = TRUE, nrow = 2))

0.06 0.04 0.04

density

density

0.03 0.02

0.02 0.01 0.00

0.00 0

20

Min.Price

40

0

20

40

Max.Price

60

80

0.03 4e−04

density

density

0.02

2e−04

0.01

0e+00

0.00 1000

2000

3000

Weight

4000

5000

150

175

Length

200

225

(c) Vehicles with rear wheel drive trains tend to be more expensive than the same type of vehicles with front wheel drive trains.

K14521_SM-Color_Cover.indd 34

30/06/15 11:46 am

Chapter 2:

Exploring Data

29

> bwplot(Price ~ DriveTrain | Type, data = Cars93, as.table = TRUE) Compact

Large

Midsize

Small

Sporty

Van

60 50 40 30 20

Price

10

60 50 40 30 20 10 4WD

Front

Rear

4WD

Front

Rear

4WD

Front

Rear

(d) > ggplot(data = Cars93, aes(x = DriveTrain, y = Price)) + + geom_boxplot() + + facet_wrap(~ Type) + + theme_bw() # black and white Compact

Large

Midsize

Small

Sporty

Van

60

40

Price

20

60

40

20

4WD

Front

Rear

4WD

Front

DriveTrain

Rear

4WD

Front

Rear

3. Load the data frame WHEATSPAIN from the PASWR2 package.

K14521_SM-Color_Cover.indd 35

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

30

(a) Find the quantiles, deciles, mean, maximum, minimum, interquartile range, variance, and standard deviation of the variable hectares. Comment on the results. What was Spain’s 2004 total harvested wheat area in hectares? (b) Create a function that calculates the quantiles, the mean, the variance, the standard deviation, the total, and the range of any variable. (c) Which communities are below the 10th percentile in hectares? Which communities are above the 90th percentile? In which percentile is Navarra? (d) Create and display in the same graphics device a frequency histogram of the variable acres and a density histogram of the variable acres. Superimpose a density curve over the second histogram. (e) Explain why using breaks of 0; 100,000; 250,000; 360,000; and 1,550,000 automatically results in a density histogram when using hist() from base graphics. (f) Create and display in the same graphics device a barplot of acres and a density histogram of acres using break points of 0; 100,000; 250,000; 360,000; and 1,550,000. (g) Add vertical lines to the density histogram of acres to indicate the locations of the mean and the median, respectively. (h) Create a boxplot of hectares and label the communities that appear as outliers in the boxplot. (Hint: Use identify().) (i) Determine the community with the largest harvested wheat surface area using either acres or hectares. Remove this community from the data frame and compute the mean, median, and standard deviation of hectares. How do these values compare to the values for these statistics computed in (a)? Solution: (a) The distribution of harvested wheat is unimodal and skewed to the right. This skew is seen in how much larger the mean (126561.5294) is versus the median (25143). The difference between Q1 and Q2 is also much smaller than the difference between Q3 and Q2 . The total harvested area is 2151546 hectares. > quantile(WHEATSPAIN$hectares) 0% 65

25% 7203

50% 75% 100% 25143 143250 619858

> quantile(WHEATSPAIN$hectares, probs = seq(from = 0.1, to = 1.0, by = 0.1)) 10% 20% 304.0 6329.4 90% 100% 410204.2 619858.0

30% 9040.6

40% 15397.6

50% 25143.0

60% 53481.2

70% 80% 88014.8 239389.2

> mean(WHEATSPAIN$hectares) [1] 126561.5 > IQR(WHEATSPAIN$hectares)

K14521_SM-Color_Cover.indd 36

30/06/15 11:46 am

Chapter 2:

Exploring Data

31

[1] 136047 > var(WHEATSPAIN$hectares) [1] 38934822657 > sd(WHEATSPAIN$hectares) [1] 197319.1 > sum(WHEATSPAIN$hectares) [1] 2151546 (b) > describe WHEATSPAIN[order(WHEATSPAIN$hectares), ] community hectares acres 2 Asturias 65 160.6 17 Canarias 100 247.1 3 Cantabria 440 1087.3 13 C.Valenciana 6111 15100.6 9 Baleares 7203 17799.0 14 Murcia 9500 23475.0 11 Madrid 13118 32415.3 1 Galicia 18817 46497.8 4 P.Vasco 25143 62129.7 6 La Rioja 34214 84544.6 5 Navarra 66326 163895.1 8 Cataluna 74206 183367.0 15 Extremadura 143250 353978.5 12 Castilla-La Mancha 263424 650934.9 7 Aragon 311479 769681.4 16 Andalucia 558292 1379569.6 10 Castilla-Leon 619858 1531702.5 > which(WHEATSPAIN[order(WHEATSPAIN$hectares), ]$community=="Navarra") [1] 11 > pk pk [1] 0.625 > quantile(WHEATSPAIN$hectares, probs = pk) 62.5% 66326

(d)

> p1 p2 multiplot(p1, p2)

K14521_SM-Color_Cover.indd 38

30/06/15 11:46 am

Chapter 2:

Exploring Data

33

8

count

6

4

2

0 0

500000

acres

1000000

1500000

density

7.5e−06

5.0e−06

2.5e−06

0.0e+00 0

500000

acres

1000000

1500000

(e) If the breaks used in hist() are not equidistant, the default is to produce a density histogram.

(f)

> > + > + + + > + + + >

K14521_SM-Color_Cover.indd 39

bins noCL mean(WHEATSPAIN$hectares) [1] 126561.5 > mean(noCL$hectares) [1] 95730.5 > median(WHEATSPAIN$hectares) [1] 25143 > median(noCL$hectares) [1] 21980 > sd(WHEATSPAIN$hectares) [1] 197319.1 > sd(noCL$hectares) [1] 155864.7

4. Load the WHEATUSA2004 data frame from the PASWR2 package. (a) Find the quantiles, deciles, mean, maximum, minimum, interquartile range, variance, and standard deviation for the variable acres. Comment on what the most appropriate measures of center and spread would be for this variable. What is the USA’s 2004 total harvested wheat surface area? (b) Which states are below the 20th percentile? Which states are above the 80th percentile? In which quantile is WI (Wisconsin)? (c) Create a frequency and a density histogram in the same graphics device using square plotting regions of the values in ACRES. (d) Add vertical lines to the density histogram from (c) to indicate the location of the mean and the median. (e) Create a boxplot of the acres and locate the outliers’ communities and their values.

K14521_SM-Color_Cover.indd 41

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

36

(f) Determine the state with the largest harvested wheat surface in acres. Remove this state from the data frame and compute the mean, median, and standard deviation of acres. How do these values compare to the values for these statistics computed in (a)? Solution: (a) The distribution of harvested wheat is unimodal and skewed to the right. This skew is seen in how much larger the mean (1148.7333) is versus the median (630). The difference between Q1 and Q2 is also much smaller than the difference between Q3 and Q2 . The total harvested area is 34462 acres. > quantile(WHEATUSA2004$acres) 0% 47.00

25% 198.75

50% 75% 100% 630.00 1213.75 8500.00

> quantile(WHEATUSA2004$acres, probs = seq(from = 0.1, to = 1.0, by = 0.1)) 10% 135.0

20% 180.0

30% 263.5

40% 416.0

50% 630.0

60% 824.0

70% 80% 90% 100% 982.5 1634.0 1925.0 8500.0

> mean(WHEATUSA2004$acres) [1] 1148.733 > IQR(WHEATUSA2004$acres) [1] 1015 > var(WHEATUSA2004$acres) [1] 2980303 > sd(WHEATUSA2004$acres) [1] 1726.355 > sum(WHEATUSA2004$acres) [1] 34462 (b) DE, NY, MS, PA, and MD are below NA thousands of acres . NE, CO, WA, TX, OK, and KS are above 1634 thousands of acres. Since WI is the ninth out of thirty states, it corresponds to the 8/29 × 100 = 27.5862th percentile. > bottom20 bottom20 20% 180 > WHEATUSA2004[WHEATUSA2004$acres < bottom20, ] # bottom states states acres

K14521_SM-Color_Cover.indd 42

ha

30/06/15 11:46 am

Chapter 2: DE MD MS NY PA

DE MD MS NY PA

47 145 135 100 135

Exploring Data

37

19.02023 58.67942 54.63256 40.46856 54.63256

> top20 top20 80% 1634 > WHEATUSA2004[WHEATUSA2004$acres > top20, ] CO KS NE OK TX WA

# top states

states acres ha CO 1700 687.9656 KS 8500 3439.8280 NE 1650 667.7313 OK 4700 1902.0225 TX 3500 1416.3997 WA 1750 708.1999

> WHEATUSA2004[order(WHEATUSA2004$acres), ] DE NY MS PA MD SC VA GA WI TN CA KY IN NC AR MI ID OR OH IL MO Other SD MT NE CO WA

K14521_SM-Color_Cover.indd 43

states acres DE 47 NY 100 MS 135 PA 135 MD 145 SC 180 VA 180 GA 190 WI 225 TN 280 CA 320 KY 380 IN 440 NC 460 AR 620 MI 640 ID 700 OR 780 OH 890 IL 900 MO 930 Other 1105 SD 1250 MT 1630 NE 1650 CO 1700 WA 1750

ha 19.02023 40.46856 54.63256 54.63256 58.67942 72.84342 72.84342 76.89027 91.05427 113.31198 129.49941 153.78054 178.06168 186.15540 250.90510 258.99881 283.27995 315.65480 360.17022 364.21708 376.35765 447.17763 505.85705 659.63760 667.73131 687.96559 708.19987

30/06/15 11:46 am

38

Probability and Statistics with R, Second Edition: Exercises and Solutions

TX OK KS

TX OK KS

3500 1416.39975 4700 1902.02252 8500 3439.82796

> which(WHEATUSA2004[order(WHEATUSA2004$acres), ]$states=="WI") [1] 9 > pk pk [1] 0.2758621 > quantile(WHEATUSA2004$acres, probs = pk) 27.58621% 225 (c) > p1 p2 multiplot(p1, p2) 10.0

count

7.5

5.0

2.5

0.0 0

2500

acres

5000

7500

0.0012

density

0.0009

0.0006

0.0003

0.0000 0

2500

acres

5000

7500

(d) > p2 p2 + geom_vline(xintercept = c(median(WHEATUSA2004$acres),

K14521_SM-Color_Cover.indd 44

30/06/15 11:46 am

Chapter 2: + + + + +

Exploring Data

39

mean(WHEATUSA2004$acres))) + annotate("text", label = "Median", x = median(WHEATUSA2004$acres), y = 0.0012) + annotate("text", label = "Mean", x = mean(WHEATUSA2004$acres), y = 0.0010)

0.00125

Median

Mean

0.00100

density

0.00075

0.00050

0.00025

0.00000 0

2500

acres

5000

7500

(e) The three outliers correspond to KS, OK, and TX.

> boxplot(WHEATUSA2004$acres) > OUTA OUTA

# outlier values

[1] 8500 4700 3500 > WHEATUSA2004[WHEATUSA2004$acres %in% OUTA, ] KS OK TX

K14521_SM-Color_Cover.indd 45

states acres ha KS 8500 3439.828 OK 4700 1902.023 TX 3500 1416.400

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

0

2000

4000

6000

8000

40

(f) Based on the output from part (b), KS the ninth indexed position in the WHEATUSA2004 data frame has the largest harvested wheat surface. The mean, median, and standard deviation are smaller than those computed in part (a) once KS is removed. > noKS mean(WHEATUSA2004$acres) [1] 1148.733 > mean(noKS$acres) [1] 895.2414 > median(WHEATUSA2004$acres) [1] 630 > median(noKS$acres) [1] 620 > sd(WHEATUSA2004$acres) [1] 1726.355 > sd(noKS$acres) [1] 1044.102

5. The data frame VIT2005 in the PASWR2 package contains descriptive information and the appraised total price (in euros) for apartments in Vitoria, Spain.

K14521_SM-Color_Cover.indd 46

30/06/15 11:46 am

Chapter 2:

Exploring Data

41

(a) Create a frequency table, a piechart, and a barplot showing the number of apartments grouped by the variable out. For you, which method conveys the information best?

(b) Characterize the distribution of the variable totalprice.

(c) Characterize the relationship between totalprice and area.

(d) Create a Trellis plot of totalprice versus area conditioning on toilets. Create the same graph with ggplot2 graphics. Are there any outliers? Ignoring any outliers, between what two values of area do apartments have both one and two bathrooms?

(e) Use the area values reported in (d) to create a subset of apartments that have both one and two bathrooms. By how much does an additional bathroom increase the appraised value of an apartment? Would you be willing to pay for an additional bathroom if you lived in Vitoria, Spain?

Solution: (a) The barplot is the easiest to read. > VIT2005$out levels(VIT2005$out) [1] "E25"

"E50"

"E75"

"E100"

> xtabs(~out, data = VIT2005) out E25 3 > + > + + > + + >

K14521_SM-Color_Cover.indd 47

E50 87

E75 E100 6 122

p1 max(VIT2005$totalprice) [1] 560000 > median(VIT2005$totalprice) [1] 269750 > IQR(VIT2005$totalprice) [1] 100125

20

count

15

10

5

0 2e+05

3e+05

totalprice

4e+05

5e+05

6e+05

(c) There is a positive linear relationship between totalprice and area.

K14521_SM-Color_Cover.indd 48

30/06/15 11:46 am

Chapter 2:

Exploring Data

43

> ggplot(data = VIT2005, aes(x = area, y = totalprice)) + + geom_point() + + theme_bw()

5e+05

totalprice

4e+05

3e+05

2e+05

80

120

160

area

(d) Apartments with one bathroom are generally between 50 and 100 m2 , while apartments with two bathrooms are generally between 80 and 160 m2 . The intersection of apartments with one and two bathrooms is roughly 80 to 100 m2 . > xyplot(totalprice ~ area | toilets, data = VIT2005, layout = c(1, 2), + as.table = TRUE) > TEXT ggplot(data = VIT2005, aes(x = area, y = totalprice, + color = as.factor(toilets))) + + geom_point() + + facet_grid(toilets ~ .) + + theme_bw() + + guides(color = guide_legend(TEXT))

toilets

5e+05

5e+05

4e+05 1

4e+05

3e+05

2e+05

toilets 5e+05

100

150

area

2

4e+05

3e+05

3e+05

2

50

1

5e+05

4e+05

2e+05

K14521_SM-Color_Cover.indd 49

Number of Toilets

2e+05

totalprice

totalprice

3e+05

2e+05 80

120

area

160

30/06/15 11:46 am

44

Probability and Statistics with R, Second Edition: Exercises and Solutions

(e) The median increase in totalprice for a second bathroom for apartments between 80 and 100 m2 is e 36000. Answers will vary for answering whether readers would be willing to spend e 36000 for an additional bathroom. > bothbaths = 80 & area ANS ANS 1 2 255000 291000 > diff(ANS) 2 36000

6. Consider the data frame PAMTEMP from the PASWR2 package, which contains temperature and precipitation for Pamplona, Spain, from January 1, 1990, to December 31, 2010. (a) Create side-by-side violin plots of the variable tmean for each month. Make sure the level of month is correct. Hint: Look at the examples for PAMTEMP. Characterize the pattern of side-by-side violin plots. (b) Create side-by-side plots of the variable tmean for each year. Characterize the pattern of side-by-side violin plots. (c) Find the date for the minimum value of tmean. (d) Find the date for the maximum value of tmean. (e) Find the date for the maximum value of precip. (f) How many days have reported a tmax value greater than 38 ◦ C? (g) Create a barplot showing the total precipitation by month for the period January 1, 1990, to December 31, 2010. Based on your barplot, which month had the least amount of precipitation? Which month had the greatest amount of precipitation? Hint: Use the plyr package to create an appropriate data frame. (h) Create a barplot showing the total precipitation by year for the period January 1, 1990, to December 31, 2010. Based on your barplot, which year had the least amount of precipitation? Which year had the greatest amount of precipitation? Hint: Use the plyr package to create an appropriate data frame. (i) Create a graph showing the maximum temperature versus year and the minimum temperature versus year. Does the graph suggest temperatures are becoming more extreme over time? Solution: (a)

K14521_SM-Color_Cover.indd 50

30/06/15 11:46 am

Chapter 2:

Exploring Data

45

> library(PASWR2) > levels(PAMTEMP$month) [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" [12] "Sep" > PAMTEMP$month levels(PAMTEMP$month) [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" [12] "Dec" > ggplot(data = PAMTEMP) + + geom_violin(aes(x = month, y = tmean, fill = month)) + + theme_bw() + + guides(fill = FALSE) + + labs(x = "", y = "Temperature (Celsius)")

30

Temperature (Celsius)

20

10

0

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

The center of each violin plot from January to July generally increases. From August to December, the center of each violin plot decreases. There is a cyclical pattern of warming and then cooling throughout the year. (b) > ggplot(data = PAMTEMP) + + geom_violin(aes(x = as.factor(year), y = tmean, + fill = as.factor(year))) + + theme_bw() + + guides(fill = FALSE) + + labs(x = "", y = "Temperature (Celsius)") + + theme(axis.text.x = element_text(angle = 60, hjust = 1))

K14521_SM-Color_Cover.indd 51

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

46

Temperature (Celsius)

30

20

10

10

09

20

08

20

07

20

06

20

05

20

04

20

03

20

02

20

01

20

00

20

99

20

98

19

97

19

96

19

95

19

94

19

93

19

92

19

91

19

19

19

90

0

There is no apparent pattern from the side-by-side violin plots of tmean. Temperature variations over the time period 1990 to 2010 for Pamplona, Spain, appears similar. (c) > PAMTEMP[which.min(PAMTEMP$tmean), ] 4285

tmax tmin precip day month year tmean 2 -10 0.5 25 Dec 2001 -4

The minimum value of tmean is -4 ◦ C which occurred on Dec 25, 2001. (d) > PAMTEMP[which.max(PAMTEMP$tmean), ] 4873

tmax tmin precip day month year tmean 39 23 0 5 Aug 2003 31

The maximum value of tmean is 31 ◦ C which occurred on Aug 5, 2003. (e) > PAMTEMP[which.max(PAMTEMP$precip), ] 1455

tmax tmin precip day month year tmean 8.6 4 69.2 25 Dec 1993 6.3

The maximum value of tmean is 69.2 mm which occurred on Dec 25, 1993. (f) > sum(PAMTEMP$tmax > 38) [1] 15

K14521_SM-Color_Cover.indd 52

30/06/15 11:46 am

Chapter 2:

Exploring Data

47

15 days reported a value greater than 38 ◦ C. (g) > library(plyr) > SEL head(SEL) 1 2 3 4 5 6

year month TP 1990 Jan 31.1008 1990 Feb 26.8005 1990 Mar 9.3001 1990 Apr 121.1001 1990 May 120.5002 1990 Jun 77.0006

Total Percipitation (1990−2010) in mm

> ggplot(data = SEL, aes(x = month, y = TP, fill = month)) + + geom_bar(stat = "identity") + + labs(y = "Total Percipitation (1990-2010) in mm", x= "") + + theme_bw() + + guides(fill = FALSE)

1500

1000

500

0 Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

August has the minimum total precipitation of all of the months for the period 1990-2010. November has the maximum total precipitation of all of the months for the period 1990-2010. (h) > SELY head(SELY) year TP 1 1990 692.5048 2 1991 704.0052

K14521_SM-Color_Cover.indd 53

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

48 3 4 5 6

1992 1993 1994 1995

902.8038 752.1041 638.8045 582.3028

> ggplot(data = SELY, aes(x = year, y = TP, fill = as.factor(year))) + + geom_bar(stat = "identity") + + labs(x = "", y = "Total Percipitation (mm)") + + theme_bw() + + guides(fill = FALSE) > SELY[which.max(SELY$TP), ] year TP 8 1997 929.2025 > SELY[which.min(SELY$TP), ] year TP 9 1998 566.2011

Total Percipitation (mm)

750

500

250

0 1990

1995

2000

2005

2010

The greatest yearly total precipitation on record (929.2025 mm) occurred in 1997. The least yearly total precipitation on record (566.2011 mm) occurred in 1998. (i) > SEL head(SEL) year Tmax Tmin 1 1990 37.0 -5.6 2 1991 38.2 -5.2 3 1992 36.4 -4.4

K14521_SM-Color_Cover.indd 54

30/06/15 11:46 am

Chapter 2:

Exploring Data

49

4 1993 37.6 -4.5 5 1994 37.2 -6.8 6 1995 40.0 -5.0 > ggplot(data = SEL, aes(x = year, y = Tmax)) + + geom_line(color = "red") + + geom_line(aes(x = year, y = Tmin), color = "blue") + + theme_bw() + + labs(y = "Temperature (Celsius)") + + geom_smooth(method = "lm", color = "red") + + geom_smooth(aes(x = year, y = Tmin), method = "lm") 40

Temperature (Celsius)

30

20

10

0

−10 1990

1995

2000

year

2005

2010

Based on the graph, there is too much variability from year to year to make any statement about the weather becoming more extreme over time. 7. Access the data from url http://www.stat.berkeley.edu/users/statlabs/data/babies.data and store the information in an object named BABIES using the function read.table(). A description of the variables can be found at http://www.stat.berkeley.edu/users/statlabs/labs.html. These data are a subset from a much larger study dealing with child health and development. (a) Create a “clean” data set that removes subjects if any observations on the subject are “unknown.” Note that bwt, gestation, parity, age, height, weight, and smoke use values of 999, 999, 9, 99, 99, 999, and 9, respectively, to denote “unknown.” Store the modified data set in an object named CLEAN. (b) Use the information in CLEAN to create a density histogram of the birth weights of babies whose mothers have never smoked (smoke=0) and another histogram placed directly below the first in the same graphics device for the birth weights of babies whose mothers currently smoke (smoke=1). Make the range of the x-axis 30 to 180 (ounces) for both histograms. Superimpose a density curve over each histogram.

K14521_SM-Color_Cover.indd 55

30/06/15 11:46 am

50

Probability and Statistics with R, Second Edition: Exercises and Solutions

(c) Based on the histograms in (b), characterize the distribution of baby birth weight for both non-smoking and smoking mothers. (d) What is the mean weight difference between babies of smokers and non-smokers? Can you think of any reasons not to use the mean as a measure of center to compare birth weights in this problem? (e) Create side-by-side boxplots to compare the birth weights of babies whose mothers never smoked and those who currently smoke. Use traditional graphics (boxplot()), lattice graphics (bwplot()), and ggplot graphics to create the boxplots. (f) What is the median weight difference between babies who are firstborn and those who are not? (g) Create a single graph of the densities for pre-pregnancy weight for mothers who have never smoked and for mothers who currently smoke. Make sure both densities appear on the same graphics device and use an appropriate legend. (h) Characterize the pre-pregnancy distribution of weight for mothers who have never smoked and for mothers who currently smoke. (i) What is the mean pre-pregnancy weight difference between mothers who do not smoke and those who do? Can you think of any reasons not to use the mean as a measure of center to compare pre-pregnancy weights in this problem? (j) Compute the body mass index (BMI) for each mother in CLEAN. Recall that BMI is defined as kg/m2 (0.0254 m= 1 in., and 0.45359 kg= 1 lb.). Add the variables weight in kg, height in m, and BMI to CLEAN and store the result in CLEANP. (k) Characterize the distribution of BMI. (l) Group pregnant mothers according to their BMI quartile. Find the mean and standard deviation for baby birth weights in each quartile for mothers who have never smoked and those who currently smoke. Find the median and IQR for baby birth weights in each quartile for mothers who have never smoked and those who currently smoke. Based on your answers, would you characterize birth weight in each group as relatively symmetric or skewed? Create histograms and densities of bwt conditioned on BMI quartiles and whether the mother smokes to verify your previous assertions about the shape. (m) Create side-by-side boxplots of bwt based on whether the mother smokes conditioned on BMI quartiles. Does this graph verify your findings in (l)? (n) Does it appear that BMI is related to the birth weight of a baby? Create a scatterplot of birth weight (bwt) versus BMI while conditioning on BMI quartiles and whether the mother smokes to help answer the question. (o) Replace baby birth weight (bwt) with gestation length (gestation) and answer questions (l), (m), and (n). (p) Create a scatterplot of bwt versus gestation conditioned on BMI quartiles and whether the mother smokes. Fit straight lines to the data using lm(), lqs(), and rlm(); and display the lines in the scatterplots. What do you find interesting about the resulting graphs? (q) Create a table of smoke by parity. Display the numerical results in a graph. What percent of mothers did not smoke during the pregnancy of their first child?

K14521_SM-Color_Cover.indd 56

30/06/15 11:46 am

Chapter 2:

Exploring Data

51

Solution: > site BABIES head(BABIES) 1 2 3 4 5 6

bwt gestation parity age height weight smoke 120 284 0 27 62 100 0 113 282 0 33 64 135 0 128 279 0 28 64 115 1 123 999 0 36 69 190 0 108 282 0 23 67 125 1 136 286 0 25 62 93 0

(a) > CLEAN CLEAN$smoke ggplot(data = CLEAN, aes(x = bwt, y = ..density..)) + + geom_histogram(fill = "lightpink") + + geom_density(color = "red") + + facet_grid(smoke ~.) + + xlim(30, 180) + + theme_bw()

Non−Smoker

0.02

density

0.01

0.00

0.02 Smoker

0.01

0.00 50

100

bwt

150

(c) Based on the density histograms in part (b), the distributions of birth weights for both smoking and non-smoking mothers are unimodal and symmetric. The mean and standard deviation for birth weights of non-smoking mothers are 123.0853 and 17.4237 ounces, respectively. The mean and standard deviation for birth weights of smoking mothers are 113.8192 and 18.295 ounces, respectively.

K14521_SM-Color_Cover.indd 57

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

52

> mean(CLEAN$bwt[CLEAN$smoke == "Non-Smoker"]) [1] 123.0853 > sd(CLEAN$bwt[CLEAN$smoke == "Non-Smoker"]) [1] 17.4237 > mean(CLEAN$bwt[CLEAN$smoke == "Smoker"]) [1] 113.8192 > sd(CLEAN$bwt[CLEAN$smoke == "Smoker"]) [1] 18.29501 (d) The mean birth weight difference between non-smoking and smoking mother’s birth weights is 9.2661 ounces. > ANS ANS Non-Smoker 123.0853

Smoker 113.8192

> DIFF DIFF Non-Smoker 9.266143 (e) > boxplot(bwt ~ smoke, data = CLEAN) > bwplot(bwt ~ smoke, data = CLEAN) > ggplot(data = CLEAN, aes(x = smoke, y = bwt)) + + geom_boxplot() + + theme_bw() 175

180

180

140

140

120

160

160

120

150

100

bwt

bwt

125

100

80

100

80

60

75

Non−Smoker

Smoker

60 Non−Smoker

Smoker

50

Non−Smoker

smoke

Smoker

(f) The median birth weight difference between firstborn babies and those that are not firstborn is 2 ounces.

K14521_SM-Color_Cover.indd 58

30/06/15 11:46 am

Chapter 2:

Exploring Data

53

> ANS ANS 0 1 120 118 > DIFF DIFF 0 2 (g) > ggplot(data = CLEAN, aes(x = weight, color = smoke)) + + geom_density() + + theme_bw()

0.020

0.015

density

smoke Non−Smoker Smoker

0.010

0.005

0.000 100

150

weight

200

250

(h) The distribution of pre-pregnancy weight for both smokers and non-smokers is unimodal and skewed to the right. The median and IQR of pre-pregnancy weight for smokers are 125 and 24.5 pounds, respectively. The median and IQR of pre-pregnancy weight for non-smokers are 126 and 25 pounds, respectively. > median(CLEAN$weight[CLEAN$smoke == "Smoker"]) [1] 125 > IQR(CLEAN$weight[CLEAN$smoke == "Smoker"]) [1] 24.5 > median(CLEAN$weight[CLEAN$smoke == "Non-Smoker"]) [1] 126 > IQR(CLEAN$weight[CLEAN$smoke == "Non-Smoker"]) [1] 25

K14521_SM-Color_Cover.indd 59

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

54

(i) The mean pre-pregnancy weight difference between non-smokers and smokers is 2.5603 pounds. The mean should not be used as a measure of center in this problem since both distributions are skewed. > ANS ANS Non-Smoker 129.4797

Smoker 126.9194

> DIFF DIFF Non-Smoker 2.56033 (j) > CLEANP ggplot(data = CLEANP, aes(x = BMI)) + geom_density() + theme_bw()

K14521_SM-Color_Cover.indd 60

30/06/15 11:46 am

Chapter 2:

Exploring Data

55

0.15

density

0.10

0.05

0.00 15

20

25

BMI

30

35

40

(l) The requested answers are computed in the following R Code 2.1. Based on the values, birth weight in each quartile appears to be symmetric regardless of the mother’s smoking status. R Code 2.1 > > + + > +

values tapply(CLEANP$bwt, list(CLEANP$Quartiles, CLEANP$smoke), + IQR) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]

Non-Smoker Smoker 20.0 25.25 21.0 21.75 19.5 24.00 23.0 22.00

> ggplot(data = CLEANP, aes(x = bwt)) + geom_histogram(fill="lightblue") + + theme_bw() + facet_grid(smoke ~ Quartiles) > ggplot(data = CLEANP, aes(x = bwt)) + geom_density() + + theme_bw() + facet_grid(smoke ~ Quartiles)

[15.7,19.9]

(19.9,21.3]

(21.3,23.3]

(23.3,40.4]

20 Non−Smoker

count

10

0

20 Smoker

10

0 80

K14521_SM-Color_Cover.indd 62

120

160

80

120

160

bwt

80

120

160

80

120

160

30/06/15 11:46 am

Chapter 2: [15.7,19.9]

Exploring Data

(19.9,21.3]

(21.3,23.3]

57 (23.3,40.4]

0.02 Non−Smoker

0.01

density

0.00

0.02 Smoker

0.01

0.00 50 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 175

bwt

(m) The boxplots also suggest the distribution of bwt is symmetric for both smokers and non-smokers in each quartile. > ggplot(data = CLEANP, aes(x = smoke, y = bwt)) + + geom_boxplot() + + facet_grid(Quartiles~.) + + theme_bw() 175 [15.7,19.9]

150 125 100 75 50 175

(19.9,21.3]

150 125 100 75

bwt

50 175 (21.3,23.3]

150 125 100 75 50 175

(23.3,40.4]

150 125 100 75 50

Non−Smoker

smoke

Smoker

(n) There appears to be no association between birth weight and BMI.

K14521_SM-Color_Cover.indd 63

30/06/15 11:46 am

58

Probability and Statistics with R, Second Edition: Exercises and Solutions

> ggplot(data = CLEANP, aes(x = BMI, y = bwt)) + + geom_point() + + facet_grid(smoke ~ Quartiles) + + theme_bw() [15.7,19.9]

(19.9,21.3]

(21.3,23.3]

(23.3,40.4]

175 150 Non−Smoker

125 100

bwt

75 50

175 150

Smoker

125 100 75 50

15 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 40

BMI

(o) The mean, standard deviation, median, and IQR for gestation grouped according to BMI quartile and smoking status are computed in R Code 2.2. Based on the values, gestation in each quartile appears to be symmetric regardless of the mother’s smoking status. R Code 2.2 > tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + mean) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]

Non-Smoker 282.8938 279.0331 277.4372 280.4764

Smoker 277.2132 277.4649 279.6636 277.4412

> tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + sd) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]

Non-Smoker 14.57214 14.57810 20.08376 15.48780

Smoker 14.55330 14.40082 15.33890 16.77727

> tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + median)

K14521_SM-Color_Cover.indd 64

30/06/15 11:46 am

Chapter 2:

[15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]

Exploring Data

59

Non-Smoker Smoker 283 279 281 280 279 279 281 277

> tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + IQR) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]

Non-Smoker Smoker 14.25 16.0 16.00 14.5 15.00 17.0 16.50 14.0

The histograms and density plots confirm a symmetric distribution of gestation regardless of BMI quartile or mother’s smoking status. > ggplot(data = CLEANP, aes(x = gestation)) + + geom_histogram(fill = "lightblue") + + theme_bw() + + facet_grid(smoke ~ Quartiles) > ggplot(data = CLEANP, aes(x = bwt)) + + geom_density() + + theme_bw() + + facet_grid(smoke ~ Quartiles)

[15.7,19.9]

(19.9,21.3]

(21.3,23.3]

(23.3,40.4]

40

Non−Smoker

30

20

count

10

0

40

30 Smoker

20

10

0 150 200 250 300 350 150 200 250 300 350 150 200 250 300 350 150 200 250 300 350

gestation

K14521_SM-Color_Cover.indd 65

30/06/15 11:46 am

60

Probability and Statistics with R, Second Edition: Exercises and Solutions [15.7,19.9]

(19.9,21.3]

(21.3,23.3]

(23.3,40.4]

0.02 Non−Smoker

0.01

density

0.00

0.02 Smoker

0.01

0.00 50 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 175

bwt

The boxplots also suggest the distribution of gestation is symmetric for both smokers and non-smokers in each quartile. > ggplot(data = CLEANP, aes(x = smoke, y = gestation)) + + geom_boxplot() + + facet_grid(Quartiles~.) + + theme_bw() 350 [15.7,19.9]

300 250 200 150 350

(19.9,21.3]

300 250

gestation

200 150 350

(21.3,23.3]

300 250 200 150 350

(23.3,40.4]

300 250 200 150 Non−Smoker

smoke

Smoker

There appears to be no association between gestation and BMI.

K14521_SM-Color_Cover.indd 66

30/06/15 11:46 am

Chapter 2:

Exploring Data

61

> ggplot(data = CLEANP, aes(x = BMI, y = gestation)) + + geom_point() + + facet_grid(smoke ~ Quartiles) + + theme_bw()

[15.7,19.9]

(19.9,21.3]

(21.3,23.3]

(23.3,40.4]

350

300 Non−Smoker

250

gestation

200

150 350

300 Smoker

250

200

150 15 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 40

BMI

(p) There seems to be less variability among the three model fits for smoking mothers versus the non-smoking mothers.

> > > + > > > + + + + + +

K14521_SM-Color_Cover.indd 67

library(MASS) lqsmod prop.table(T1, 2) pclass survived 1st 2nd 3rd No 0.3808050 0.5703971 0.7447109 Yes 0.6191950 0.4296029 0.2552891 (b) 8.0978% of women in third class survived while 4.66% of men in first class survived. > T2 T2 , , survived = No sex pclass female male 1st 5 118 2nd 12 146 3rd 110 418 , , survived = Yes sex pclass female male 1st 139 61 2nd 94 25 3rd 106 75 > prop.table(T2) , , survived = No sex pclass female male 1st 0.003819710 0.090145149 2nd 0.009167303 0.111535523 3rd 0.084033613 0.319327731 , , survived = Yes sex pclass female male 1st 0.106187930 0.046600458

K14521_SM-Color_Cover.indd 70

30/06/15 11:46 am

Chapter 2:

Exploring Data

65

2nd 0.071810542 0.019098549 3rd 0.080977846 0.057295646 (c) The distribution age is bimodal and skewed to the right. The median is 28, and the IQR is 18. > median(TITANIC3$age, na.rm = TRUE) [1] 28 > IQR(TITANIC3$age, na.rm = TRUE) [1] 18 > ggplot(data = TITANIC3, aes(x = age)) + + geom_density(fill = "pink") + + theme_bw()

0.03

density

0.02

0.01

0.00 0

20

40

age

60

80

(d) With out considering pclass, the mean and median age for surviving females was higher than the mean and median age for males who survived. When pclass is taken into account, the mean age for females is greater than the mean age for males except in third class, while the median age for females is greater than the median age for males only in second class. > with(data = TITANIC3, tapply(age, list(survived, sex), mean, + na.rm = TRUE)) female male No 25.25521 31.51641 Yes 29.81535 26.97778 > with(data = TITANIC3, tapply(age, list(survived, sex), sd, + na.rm = TRUE))

K14521_SM-Color_Cover.indd 71

30/06/15 11:46 am

66

Probability and Statistics with R, Second Edition: Exercises and Solutions

female male No 13.47688 13.79635 Yes 14.76928 15.55388 > with(data = TITANIC3, tapply(age, list(survived, sex), median, + na.rm = TRUE)) No Yes

female male 24.5 29 28.5 27

> with(data = TITANIC3, tapply(age, list(survived, sex), IQR, + na.rm = TRUE)) No Yes

female male 13.25 18.0 19.00 16.5

> with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), mean, na.rm = TRUE)) , , female No Yes 1st 35.20000 37.10938 2nd 34.09091 26.71105 3rd 23.41875 20.81482 , , male No Yes 1st 43.65816 36.16824 2nd 33.09259 17.44927 3rd 26.67960 22.43644 > with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), sd, na.rm = TRUE)) , , female No Yes 1st 23.44568 13.93813 2nd 14.05315 12.62080 3rd 12.04303 12.32179 , , male No Yes 1st 13.66284 15.09160 2nd 12.13161 16.70854 3rd 11.75896 10.70842 > with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), median, na.rm = TRUE))

K14521_SM-Color_Cover.indd 72

30/06/15 11:46 am

Chapter 2:

Exploring Data

67

, , female No Yes 1st 36.0 35.5 2nd 29.0 27.5 3rd 22.5 22.0 , , male No Yes 1st 45 36 2nd 30 19 3rd 25 25 > with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), IQR, na.rm = TRUE)) , , female No Yes 1st 25.000 24.00 2nd 16.000 14.25 3rd 13.125 12.00 , , male No Yes 1st 22.125 21.0 2nd 16.000 27.5 3rd 13.000 10.5 (e) Both the mean and median age for males who survived were lower than the mean and median age for males who did not survive with the exception that the median age for surviving and non-surviving males were the same in third class. (f) The youngest female in first class to survive was 14 years old. > with(data = TITANIC3, + sort(age[sex =="female" & survived =="Yes" & pclass == "1st"]))[1] [1] 14 (g) Answers will vary. 9. Use the CARS2004 data frame from the PASWR2 package, which contains the numbers of cars per 1000 inhabitants (cars), the total number of known mortal accidents (deaths), and the country population/1000 (population) for the 25 member countries of the European Union for the year 2004. (a) Compute the total number of cars per 1000 inhabitants in each country, and store the result in an object named total.cars. Determine the total number of known automobile fatalities in 2004 divided by the total number of cars for each country and store the result in an object named death.rate.

K14521_SM-Color_Cover.indd 73

30/06/15 11:46 am

68

Probability and Statistics with R, Second Edition: Exercises and Solutions

(b) Create a barplot showing the automobile death rate for each of the European Union member countries. Make the bars increase in magnitude so that the countries with the smallest automobile death rates appear first. (c) Which country has the lowest automobile death rate? Which country has the highest automobile death rate? (d) Create a scatterplot of population versus total.cars. How would you characterize the relationship? (e) Find the least squares estimates for regressing population on total.cars. Superimpose the least squares line on the scatterplot from (d). What population does the least squares model predict for a country with a total.cars value of 19224.630? Find the difference between the population predicted from the least squares model and the actual population for the country with a total.cars value of 19224.630. (f) Create a scatterplot of total.cars versus death.rate. How would you characterize the relationship between the two variables? (g) Compute Spearman’s rank correlation coefficient of total.cars and death.rate. (Hint: Use cor(x, y, method="spearman").) What is this coefficient measuring? (h) Plot the logarithm of total.cars versus the logarithm of death.rate. How would you characterize the relationship? (i) What are the least squares estimates for the regression of log(total.cars) on log(death.rate). Superimpose the least squares line on the scatterplot from (h). What total number of cars does the least squares model predict for a country with a log(death.rate) value of -3.769252? Make sure you express your answer in the same units as those used for total.cars. Solution: (a) > CARS2004 head(CARS2004) country cars deaths population death.rate total.cars 1 Belgium 467 112 10396 0.02306932 4854.932 2 Czech Republic 373 135 10212 0.03544167 3809.076 3 Denmark 354 68 5398 0.03558548 1910.892 4 Germany 546 71 82532 0.00157559 45062.472 5 Estonia 350 126 1351 0.26646928 472.850 6 Greece 348 147 11041 0.03825865 3842.268 (b) > ggplot(data =CARS2004, aes(x = reorder(country, death.rate), + y = death.rate)) + + geom_bar(stat = "identity", fill = "red") +

K14521_SM-Color_Cover.indd 74

30/06/15 11:46 am

Chapter 2: + + + +

Exploring Data

69

coord_flip() + labs(x = "", y = "Death Rate", title = "European 2004 Vehicular Death Rate") + theme_bw()

European 2004 Vehicular Death Rate

Cyprus Luxembourg Latvia Estonia Lithuania Malta Slovenia Slovakia Ireland Hungary Greece Denmark Czech Republic Finland Austria Belgium Portugal Sweden Poland Netherlands Spain France Italy United Kingdom Germany 0.0

0.1

0.2

Death Rate

0.3

0.4

0.5

(c) The country with the lowest automobile death rate is Germany while Cyprus has the highest automobile death rate. (d) There is a positive curvilinear relationship between total.cars and population.

> ggplot(data = CARS2004, aes(x = total.cars, y = population)) + + geom_point() + + geom_smooth() + + theme_bw()

K14521_SM-Color_Cover.indd 75

30/06/15 11:46 am

70

Probability and Statistics with R, Second Edition: Exercises and Solutions

75000

population

50000

25000

0

0

10000

20000

total.cars

30000

40000

(e) > mod.lm summary(mod.lm) Call: lm(formula = population ~ total.cars, data = CARS2004) Residuals: Min 1Q Median -7500 -1840 -1013

3Q 1015

Max 13510

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.124e+03 9.731e+02 2.183 0.0395 * total.cars 1.881e+00 6.561e-02 28.668 ggplot(data = CARS2004, aes(x = total.cars, y = population)) + + geom_point() + + geom_smooth(method = "lm") + + theme_bw() > POP POP

K14521_SM-Color_Cover.indd 76

30/06/15 11:46 am

Chapter 2:

Exploring Data

71

1 38285550 > resid(mod.lm)[7]*1000 7 4059450

population

75000

50000

25000

0 0

10000

20000

total.cars

30000

40000

The least squares model predicts a population of 38285550.4948 people. Spain has a total.cars value of 19224.630 and a reported population of 42,345,000. The difference between Spain’s actual population and the value predicted with least squares is the seventh residual 42, 345, 000 − 38, 285, 550 = 4, 059, 450. (f) There is a decreasing monotonic relationship between total.cars and death.rate.

> ggplot(data = CARS2004, aes(x = death.rate, y = total.cars)) + + geom_point() + + theme_bw()

K14521_SM-Color_Cover.indd 77

30/06/15 11:46 am

72

Probability and Statistics with R, Second Edition: Exercises and Solutions

40000

total.cars

30000

20000

10000

0 0.0

0.1

0.2

death.rate

0.3

0.4

0.5

(g) Spearman’s rank correlation is a measure of the monotonic relationship between two variables.

> with(data = CARS2004, cor(total.cars, death.rate, method = "spearman")) [1] -0.9676923

(h) The relationship is strong, negative, and linear between the logarithm of total.cars and the logarithm of death.rate.

> ggplot(data = CARS2004, aes(x = log(death.rate), y = log(total.cars))) + + geom_point() + + theme_bw()

K14521_SM-Color_Cover.indd 78

30/06/15 11:46 am

Chapter 2:

Exploring Data

73

10

log(total.cars)

9

8

7

6

−6

−4

log(death.rate)

−2

(i) The total number of cars predicted for a country with a logdeath.rate = -3.769252 is 4231.018 cars.

> ggplot(data = CARS2004, aes(x = log(death.rate), y = log(total.cars))) + + geom_point() + + theme_bw() + + geom_smooth(method = "lm") > modlm.log coef(summary(modlm.log)) Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0206666 0.19568324 25.65711 1.994256e-18 log(death.rate) -0.8833401 0.05142204 -17.17824 1.293676e-14 > TOTCARS TOTCARS 1 4231.018

K14521_SM-Color_Cover.indd 79

30/06/15 11:46 am

74

Probability and Statistics with R, Second Edition: Exercises and Solutions

11

10

log(total.cars)

9

8

7

6

−6

−4

log(death.rate)

−2

10. The data frame SURFACESPAIN in the PASWR2 package contains the surface area (km2 ) for seventeen autonomous Spanish communities. (a) Use the function merge() to combine the data frames WHEATSPAIN (from Problem 3) and SURFACESPAIN into a new data frame named DataSpain. (b) Create a variable named surface.h containing the surface area of each autonomous community in hectares. (Note: 100 hectares = 1 km2 .) Create a variable named wheat.p containing the percent surface area in each autonomous community dedicated to growing wheat. Add the newly created variables to the data frame DataSpain and store the result as a data frame with the name DataSpain.m. (c) Assign the names of the autonomous communities as row names for DataSpain.m and remove the variable community from the data frame. (d) Create a barplot showing the percent surface area dedicated to growing wheat for each of the seventeen Spanish autonomous communities. Arrange the communities by decreasing percentages. (e) Display the percent surface area dedicated to growing wheat for each of the seventeen Spanish autonomous communities using the function dotchart(). To read about dotchart(), type ?dotchart at the command prompt. Do you prefer the barchart or the dotchart? Explain your answer. (f) Describe the relationship between the surface area in an autonomous community dedicated to growing wheat (hectares) and the total surface area of the autonomous community (surface.h). (g) Describe the relationship between the surface area in an autonomous community dedicated to growing wheat (hectares) and the percent of surface area dedicated to growing wheat out of the communities’ total surface area (wheat.p).

K14521_SM-Color_Cover.indd 80

30/06/15 11:46 am

Chapter 2:

Exploring Data

75

(h) Develop a model to predict the surface area in an autonomous community dedicated to growing wheat (hectares) based on the total surface area of the autonomous community (surface.h). Solution: (a) > DataSpain head(DataSpain) community hectares acres cuts surface 1 Andalucia 558292 1379569.6 (3.6e+05,1.55e+06] 87268 2 Aragon 311479 769681.4 (3.6e+05,1.55e+06] 47719 3 Asturias 65 160.6 (0,1e+05] 10604 4 Baleares 7203 17799.0 (0,1e+05] 4992 5 C.Valenciana 6111 15100.6 (0,1e+05] 23255 6 Canarias 100 247.1 (0,1e+05] 7447 (b) > DataSpain.m DataSpain.m[1:6, c(1,2,3,6,7)] community hectares acres wheat.p surface.h 1 Andalucia 558292 1379569.6 6.397442361 8726800 2 Aragon 311479 769681.4 6.527358075 4771900 3 Asturias 65 160.6 0.006129762 1060400 4 Baleares 7203 17799.0 1.442908654 499200 5 C.Valenciana 6111 15100.6 0.262782197 2325500 6 Canarias 100 247.1 0.013428226 744700 (c) > > > >

COM with(data = DataSpain.m[order(DataSpain.m$wheat.p), ], + dotchart(wheat.p, pch = 19, + labels = row.names(DataSpain.m[order(DataSpain.m$wheat.p), ]), + xlab = "Percent of Community Surface Area Dedicated to Growing Wheat"))

K14521_SM-Color_Cover.indd 82

30/06/15 11:46 am

Chapter 2:

Exploring Data

77

La Rioja Castilla−Leon

La Rioja

Aragon

Castilla−Leon

Andalucia

Aragon

Navarra

Andalucia

P.Vasco

Navarra

Extremadura

P.Vasco Extremadura

Castilla−La Mancha

Castilla−La Mancha

Cataluna

Cataluna Madrid

Madrid

Baleares

Baleares

Murcia

Murcia

Galicia

Galicia

C.Valenciana Cantabria

C.Valenciana

Canarias

Cantabria

Asturias

Canarias

0

Asturias 0

2

4

6

1

2

3

4

5

6

7

Percent of Community Surface Area Dedicated to Growing Wheat

Percent of Community Surface Area Dedicated to Growing Wheat

(f) There is a positive linear relationship between surface.h and hectares. > ggplot(data = DataSpain.m, aes(x = surface.h, y = hectares)) + + geom_point() + + labs(y = "Surface Area Dedicated to Growing Wheat", + x = "Total Surface Area of Community") + + theme_bw()

Surface Area Dedicated to Growing Wheat

6e+05

4e+05

2e+05

0e+00 2500000

5000000

Total Surface Area of Community

7500000

(g) There is a weak positive association between wheat.p (percent of surface area dedicated to growing wheat) and hectares (wheat growing area) that spreads out in a funel fashion as the percent of surface area dedicated to growing wheat increases.

K14521_SM-Color_Cover.indd 83

30/06/15 11:46 am

78

Probability and Statistics with R, Second Edition: Exercises and Solutions

> ggplot(data = DataSpain.m, aes(x = wheat.p, y = hectares)) + + geom_point() + + labs(y = "Surface Area Dedicated to Growing Wheat", + x = "Percent Surface Area Dedicated to Growing Wheat") + + theme_bw()

Surface Area Dedicated to Growing Wheat

6e+05

4e+05

2e+05

0e+00 0

2

4

Percent Surface Area Dedicated to Growing Wheat

6

(h) = −52526.918 + The least squares line from regressing hectares on surface.h is hectares 0.0602 × surface.h. > model coef(summary(model)) Estimate Std. Error t value Pr(>|t|) (Intercept) -5.252692e+04 2.597808e+04 -2.021971 6.139094e-02 surface.h 6.021268e-02 6.198023e-03 9.714820 7.301776e-08

K14521_SM-Color_Cover.indd 84

30/06/15 11:46 am

Chapter 3 General Probability and Random Variables

1. How many ways can a host randomly choose 8 people out of 90 in the audience to participate in a TV game show? Solution: > choose(90, 8) [1] 77515521435 There are 90 8 = 77, 515, 435 ways to choose 8 people from 90. 2. How many different six-place license plates are possible if the first two places are letters and the remaining places are numbers? Solution: > 26 * 26 * 10 * 10 * 10 * 10 [1] 6760000 There are a total of 6760000 possible license plates. 3. How many different six-place license plates are possible (first two places letters, remaining places numbers) if repetition among letters and numbers is not permissible? Solution: > 26 * 25 * 10 * 9 * 8 [1] 468000 There are a total of 468000 possible license plates if repetition among letters and numbers is not permissible. 4. Susie has 25 books she would like to arrange on her desk. Of the 25 books, 7 are statistics books, 6 are biology books, 5 are English books, 4 are history books, and 3 are psychology books. If Susie arranges her books by subject, how many ways can she arrange her books? Solution:

79

K14521_SM-Color_Cover.indd 85

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

80

> factorial(5) * factorial(7) * factorial(6) * factorial(5) * + factorial(4) * factorial(3) [1] 7.52468e+12 Susie can arrange her books 7524679680000 different ways. 5. A hat contains 20 consecutive numbers (1 to 20). If four numbers are drawn at random, how many ways are there for the largest number to be a 16 and the smallest number to be a 5? Solution: > choose(10, 2) [1] 45 There is one way for the smallest number to be a 5 and the largest number to be a 16. This leaves two numbers to be drawn between the 6 and 15. There are 10 numbers between 6 and 15 inclusive, so the remaining two numbers can be selected 10 2 = 45 different ways. 6. A university committee of size 10, consisting of 2 faculty from the college of fine and applied arts, 2 faculty from the college of business, 3 faculty from the college of arts and sciences, and 3 administrators, is to be selected from 6 fine and applied arts faculty, 7 college of business faculty, 10 college of arts and sciences faculty, and 5 administrators. How many committees are possible? Solution: > TN TN [1] 378000 There are a total of 378000 possible committees. 7. How many different letter arrangements can be made from the letters BIOLOGY, PROBABILITY, and STATISTICS, respectively? Solution: > > > >

BA choose(12, 4) * choose(8, 5) [1] 27720 The 12 rooms can be painted in a total of 27720 possible ways. 9. A shipment of 50 laptops includes 3 that are defective. If an instructor purchases 4 laptops from the shipment to use in his class, how many ways are there for the instructor to purchase at least 2 of the defective laptops? Solution: > choose(3, 2)*choose(47, 2) + choose(3, 3)*choose(47, 1) [1] 3290

10. A multiple-choice test consists of 10 questions. Each question has 5 answers (only one is correct). How many different ways can a student fill out the test? Solution: > 5^10 [1] 9765625

11. How many ways can five politicians stand in line? In how many ways can they stand in line if two of the politicians refuse to stand next to each other? Solution: > factorial(5) [1] 120 > factorial(5) - 2 * factorial(4) [1] 72 There are 120 ways five politicians can stand in line. If two of the politicians refuse to stand next to each other, there are 72 ways they may stand in line.

K14521_SM-Color_Cover.indd 87

30/06/15 11:46 am

82

Probability and Statistics with R, Second Edition: Exercises and Solutions

12. There are five different colored jerseys worn throughout the Tour de France. The yellow jersey is worn by the rider with the least accumulated time; the green jersey is worn by the best sprinter; the red and white polka dot jersey is worn by the best climber. The white jersey is worn by the best youngest rider, and the red jersey is worn by the rider with the most accumulated time still in the race. If 150 riders finish the Tour, how many different ways can the yellow, green, and red and white polka dot jerseys be awarded if (a) a rider can receive any number of jerseys and (b) each rider can receive at most one jersey? Solution: > 150^3

# part a

[1] 3375000 > 150*149*148

# part b

[1] 3307800

13. A president, treasurer, and secretary, all different, are to be chosen from among the 10 active members of a university club. How many different choices are possible if (a) There are no restrictions. (b) A will serve only if she is the treasurer. (c) B and C will not serve together. (d) D and E will serve together or not at all. (e) F must be an officer. Solution: (a) > 10 * 9 * 8 [1] 720 (b) > 9 * 8 * 7 + 9 * 8 [1] 576 (c) > 8 * 7 * 6 + 3 * 2 * 8 * 7 [1] 672 (d)

K14521_SM-Color_Cover.indd 88

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

83

> 8 * 7 * 6 + 3 * 2 * 8 [1] 384 (e) > 3 * 9 * 8 [1] 216

14. On a multiple-choice exam with three possible answers for each of the five questions, what is the probability that a student would get four or more correct answers just by guessing? Solution: > choose(5, 4)*(1/3)^4*(2/3)^1 + choose(5, 5)*(1/3)^5*(2/3)^0 [1] 0.04526749

15. Suppose four balls are chosen at random without replacement from an urn containing six black balls and four red balls. What is the probability of selecting two balls of each color? Solution: > choose(4, 2) * choose(6, 2)/choose(10, 4) [1] 0.4285714

16. What is the probability that a hand of five cards chosen randomly and without replacement from a standard deck of 52 cards contains the ace of hearts, exactly one other ace, and exactly two kings? Solution: > choose(1, 1)*choose(3, 1)*choose(4, 2)*choose(44, 1)/choose(52, 5) [1] 0.0003047373

17. In the New York State lottery game, six of the numbers 1 through 54 are chosen by a customer. Then, in a televised drawing, six of these numbers are selected. If all six of a customer’s numbers are selected, then that customer wins a share of the first prize. If five or four of the numbers are selected, the customer wins a share of the second or the third prize. What is the probability that any customer will win a share of the first prize, the second prize, and the third prize, respectively? Solution:

K14521_SM-Color_Cover.indd 89

30/06/15 11:46 am

84

Probability and Statistics with R, Second Edition: Exercises and Solutions

> choose(6, 6)/choose(54, 6)

# Pr(first prize)

[1] 3.871892e-08 > choose(6, 5)*choose(48, 1)/choose(54, 6)

# Pr(second prize)

[1] 1.115105e-05 > choose(6, 4)*choose(48, 2)/choose(54, 6)

# Pr(third prize)

[1] 0.0006551242

18. An office supply store is selling packages of 100 CDs at a very affordable price. However, roughly 10% of all packages are defective. If a package of 100 CDs containing exactly 10 defective CDs is purchased, find the probability that exactly 2 of the first 5 CDs used are defective. Solution: > (choose(10, 2) * choose(90, 3))/choose(100, 5) [1] 0.07021881

19. A box contains six marbles, two of which are black. Three are drawn with replacement. What is the probability two of the three are black? Solution: > fractions(choose(3, 2)*(1/3)^2*(2/3)^1) [1] 2/9

20. The ASU triathlon club consists of 11 women and 7 men. What is the probability of selecting a committee of size four with exactly three women? Solution: > choose(11, 3)*choose(7, 1)/choose(18, 4) [1] 0.377451 > # or > (11/18)*(10/17)*(9/16)*(7/15)*4 [1] 0.377451

21. Four golf balls are to be placed in six different containers. One ball is red; one, green; one, blue; and one, yellow.

K14521_SM-Color_Cover.indd 90

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

85

(a) In how many ways can the four golf balls be placed into six different containers? Assume that any container can contain any number of golf balls (as long as there are a total of four golf balls). (b) In how many ways can the golf balls be placed if container one remains empty? (c) In how many ways can the golf balls be placed if no two golf balls can go into the same container? (d) What is the probability that no two golf balls are in the same container, assuming that the balls are randomly tossed into the containers? Solution: (a) 64 = 1296 (b) 54 = 625 (c) 6 · 5 · 4 · 3 = 360 (d)

6·5·4·3 64

=

360 1296

=

5 18

22. Three dice are thrown. What fraction of the time does a sum of 9 appear on the faces? What percent of the time does a sum of 10 appear? Solution: > SS SUM fractions(mean(SUM == 9)) [1] 25/216 > mean(SUM == 10)*100 [1] 12.5 A sum of 9 appears 1157/10000 of the time. A sum of 10 appears 12.5% of the time. 23. Assume that P(A) = 0.5, P(A ∩ C) = 0.2, P(C) = 0.4, P(B) = 0.4, P(A ∩ B ∩ C) = 0.1, P(B ∩ C) = 0.2, and P(A ∩ B) = 0.2. Calculate the following probabilities: (a) P(A ∪ B ∪ C) (b) P(Ac ∩ (B ∪ C)) (c) P ((B ∩ C)c ∪ (A ∩ B)c ) (d) P(A) − P(A ∩ C) Solution: (a) 0.8, (b) 0.3, (c) 0.9, (d) 0.3 24. In a 10k race where three runners, Susie, Mike, and Anna, enter the race with identical personal best times, assume they all have an equal chance of winning today’s 10k. Consider the events: • E1 : Susie wins the 10k.

K14521_SM-Color_Cover.indd 91

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

86

• E2 : Susie places second in the 10k. • E3 : Susie places third in the 10k. • W : Susie places higher than Mike. Is W independent of E1 , E2 , and E3 ? Solution: The sample space of outcomes is Ω = {SAM, SM A, AM S, ASM, M SA, M AS}, where each person in each place is represented by their name’s first initial. Since P(W ) =

3 1 = , 6 2

P(E1 ) =

1 2 = , 6 3

P(E2 ) =

1 2 = , 6 3

P(E3 ) =

1 2 = , 6 3

and 1/3 P(W ∩ E1 ) = = 1, P(E1 ) 1/3 1/6 1 P(W ∩ E2 ) P(W |E2 ) = = = , P(E2 ) 1/3 2 0 P(W ∩ E3 ) = = 0, P(W |E3 ) = P(E3 ) 1/3

P(W |E1 ) =

we can say W is independent of E2 but not of E1 or of E3 . 25. Verify that P(F |E) satisfies the three axioms of probability. Solution: It must be shown that P(F |E) =

P(F ∩ E) satisfies P(E)

(1) 0 ≤ P(F |E) ≤ 1 (2) P(Ω|E) = 1 (3) P(∪∞ i=1 Fi |E) =

n

i=1

P(Fi |E)

(1) The left side is obvious since all probabilities must be greater than or equal to zero. Since F ∩ E ⊂ E, it follows that P(F ∩ E) ≤ P(E) which is less than or equal to one. P(Ω ∩ E) P(E) (2) P(Ω|E) = = =1 P(E) P(E) (3) P ∪∞ i=1 (Fi ∩ E) ∞ P(∪i=1 Fi |E) = P(E) ∞ P(Fi ∩ E) = P(E) i=1 =

n i=1

K14521_SM-Color_Cover.indd 92

P(Fi |E)

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

87

26. If A and B are independent events, show that Ac and B c are also independent events. Solution: If A and B are independent, then P(A ∩ B) = P(A)P(B). P(Ac ∩ B c ) = P((A ∪ B)c ) = 1 − P(A ∪ B) = 1 − P(A) − P(B) + P(A ∩ B) = 1 − P(A) − P(B) + P(A)P(B)

= 1 − P(A)(1 − P(B)) − P(B) = 1 − P(B) − P(A)(1 − P(B)) = (1 − P(B))(1 − P(A)) = P(Ac )P(B c )

Consequently, Ac and B c are also independent. 27. Let A and B be events where 0 < P(A) < 1 and 0 < P(B) < 1. Is P(A|B)+P(Ac |B c ) = 1 true when A and B are (a) mutually exclusive? (b) independent? If P(A|B) + P(Ac |B c ) = 1 is not true for either (a) or (b), provide a counterexample. Solution: (a) If A and B are mutually exclusive, P(A|B) + P(Ac |B c ) = 1. Counterexample: Consider rolling a fair die and let the event A be rolling an even number and the event B be rolling an odd number. Then, P(A|B) + P(Ac |B c ) = 0 + 0 = 0 = 1. (b) If A and B are independent, then P(A|B) + P(Ac |B c ) = 1 because P(A|B) + P(Ac |B c ) = P(A) + P(Ac ) = 1. Recall that if A and B are independent, Ac and B c are also independent; so P(A|B) = P(A) and P(Ac |B c ) = P(Ac ). 28. A family has three cars, all with electric windows. Car A’s windows always work. Car B’s windows work 30% of the time, and Car C’s windows work 75% of the time. The family uses Car A 32 of the time; Car B, 29 of the time; and Car C, the remaining fraction. (a) On a particularly hot day, when the family wants to roll the windows down, compute the probability the windows will work. (b) If the electric windows work, find the probability the family is driving Car C. Solution: Let A, B, and C be the events of using the cars A, B, and C respectively and let T be the event windows work properly. (a) P(T ) = P(T |A)P(A) + P(T |B)P(B) + P(T |C)P(C) =

K14521_SM-Color_Cover.indd 93

2 2 1 × 1 + × 0.3 + × 0.75 = 0.8167 3 9 9

30/06/15 11:46 am

88

Probability and Statistics with R, Second Edition: Exercises and Solutions

(b) Applying Bayes’ formula

P (C|T ) =

1 × 0.75 P (T |C)P (C) = 9 = 0.102. P (T ) 0.8167

29. A new drug test being considered by the International Olympic Committee can detect the presence of a banned substance when it has been taken by the subject in the last 90 days 98% of the time. However, the test also registers a “false positive” in 2% of the population that has never taken the banned substance. If 2% of the athletes in question are taking the banned substance, what is the probability a person that has a positive drug test is actually taking the banned substance? Solution: Let B = banned substance is present and + = a positive test. P( +|B ) =0.98 P(+|B c ) =0.02

P(B|+) =

P( B ) =0.02 P(B c ) =0.98

P(B ∩ +) P(+)

P(+|B) · P(B) P(+|B) · P(B) + P(+|B c ) · P(B c ) 0.98 × 0.02 = 0.50 = 0.98 × 0.02 + 0.02 × 0.98

=

30. The products of an agricultural firm are delivered by four different transportation companies, A, B, C, and D. Company A transports 40% of the products; company B, 30%; company C, 20%; and, finally, company D, 10%. During transportation, 5%, 4%, 2%, and 1% of the products spoil with companies A, B, C, and D, respectively. If one product is randomly selected, (a) Obtain the probability that it is spoiled. (b) If the chosen product is spoiled, derive the probability that it has been transported by company A. Solution: Let A, B, C, and D be the events associated with a company transporting a product and S be the event that the product is spoiled. P(A) = .40, P(B) = .30, P(C) = .20, and P(D) = .10. Also, P(S|A) = .05, P(S|B) = .04, P(S|C) = .02, and P(S|D) = .01

K14521_SM-Color_Cover.indd 94

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

89

(a) P(S) = P(A)P(S|A) + P(B)P(S|B) + P(C)P(S|C) + P(D)P(S|D) = (0.4)(0.05) + (0.3)(0.04) + (0.2)(0.02) + (0.1)(0.01) = 0.037 (b) P(A|S) =

P(S|A) · P(A) (0.05)(0.4) P(A ∩ S) = = = 0.5405 P(S) P(S) 0.037

31. Two lots of large glass beads are available (A and B). Lot A has four beads, two of which are chipped; and lot B has five beads, two of which are chipped. Two beads are chosen at random from lot A and passed to lot B. Then, one bead is randomly selected from lot B. Find: (a) The probability that the selected bead is chipped. (b) The probability that the two beads selected from lot A were not chipped if the bead selected from lot B is not chipped. Solution: (a) Let B1 represent the lot obtained from passing two chipped beads from A to B. Let B2 represent the lot obtained from passing one chipped bead and one non-chipped bead from A to B. Let B3 represent the lot obtained from passing two non-chipped beads from A to B. Let C be the event selecting a chipped bead. 2 2 2 2 2 2 1 4 1 1 1 0 2 2 0 = ; P(B 2 ) = = ; P(B 3 ) = = P(B 1 ) = 4 4 4 6 6 6 2 2 2 P(C) = P(C|B1 ) · P(B1 ) + P(C|B2 ) · P(B2 ) + P(C|B3 ) · P(B3 ) 18 3 4 1 3 4 2 1 = = · + · + · = 7 6 7 6 7 6 42 7 (b) P(C c |B3 ) · P(B3 ) P(C c ) 5 1 · 5 = 746 = 24 7

P(B3 |C c ) =

32. A box contains 5 defective bulbs, 10 partially defective (they start to fail after 10 hours of use), and 25 perfect bulbs. If a bulb is tested and it does not fail immediately, find the probability that the bulb is perfect. Solution: Let P = a perfect bulb and N = does not fail immediately. 25/ P(P ∩ N ) 5 P(P |N ) = = 35 40 = P(N ) /40 7

K14521_SM-Color_Cover.indd 95

30/06/15 11:46 am

90

Probability and Statistics with R, Second Edition: Exercises and Solutions

33. A salesman in a department store receives household appliances from three suppliers: I, II, and III. From previous experience, the salesman knows that 2%, 1%, and 3% of the appliances from suppliers I, II, and III, respectively, are defective. The salesman sells 35% of the appliances from supplier I, 25% from supplier II, and 40% from supplier III. If an appliance randomly selected is defective, find the probability that it comes from supplier III. Solution: Let I, II, and III be events associated with suppliers and D be the event associated with a defective appliance. P(D) = P(D|I) · P(I) + P(D|II) · P(II) + P(D|III) · P(III) = 0.02 × 0.35 + 0.01 × 0.25 + 0.03 × 0.40 = 0.0215 P(III|D) =

P(D|III) · P(III) 0.03 × 0.4 = = 0.5581 P(D) 0.0215

34. Last year, a new business purchased 25 tablets, 25 laptops, and 50 desktops, all with three year warranties. The probability a tablet has had warranty work is four times the probability a desktop has had warranty work. The probability a laptop has had warranty work is twice the probability a desktop has had warranty work. Given that 10 computers have used the warranty, (a) If a computer is a laptop, what is the probability it has had warranty work? (b) If a computer has had warranty work, what is the probability it was a laptop? (c) If a computer has had warranty work, what is the probability it was a tablet? Solution: Let T represent tablets, L represent laptops, D represent desktops, and W represent warranty use. P(T ) = 0.25, P(W |D) = p,

P(L) = 0.25, P(W |L) = 2p,

P(D) = 0.50 P(W |T ) = 4p

P(W ) = 0.10 (a) By the theorem of total probability P(W ) = P(W |T )P(T ) + P(W |L)P(L) + P(W |D)P(D) 0.10 = 4p 0.25 + 2p 0.25 + p 0.50 0.10 = 2p =⇒ p = 0.05. The probability a computer has had warranty work given that it is a laptop is P(W |L) = 2p = 0.10.

K14521_SM-Color_Cover.indd 96

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

91

(b) The probability a computer is a laptop given that it has had warranty work is P(L|W ) =

P(W |L)P(L) 0.10 · 0.25 = = 0.25. P(W ) 0.10

(c) The probability a computer is a tablet given that it has had warranty work is P(T |W ) =

P(W |T )P(T ) 0.2 · 0.25 = = 0.50 P(W ) 0.10

35. An urn contains 14 balls; 6 of them are white, and the others are black. Another urn contains 9 balls; 3 are white, and 6 are black. A ball is drawn at random from the first urn and is placed in the second urn. Then, a ball is drawn at random from the second urn. If this ball is white, find the probability that the ball drawn from the first urn was black. Solution: Let W1 = first draw is white and W2 be second draw is white. Since there are only two colors of balls, W1c = first draw is black and W2c is the second draw is black.

P(W1c ∩ W2 ) P(W2 ) P(W2 |W1c ) · P(W1c ) = P(W2 ) P(W2 |W1c ) · P(W1c ) = c P(W2 |W1 ) · P(W1c ) + P(W2 |W1 ) · P(W1 )

P(W1c |W2 ) =

= =

3 8 10 · 14 8 4 6 · 14 + 10 · 14 24 1 140 24 24 = 2 140 + 140 3 10

36. Previous to the launching of a new flavor of yogurt, a company has conducted taste tests with four new flavors: lemon, strawberry, peach, and cherry. It obtained the following 2 3 4 probabilities of a successful launch: P(lemon) = 10 , P(strawberry) = 10 , P(peach) = 10 , 5 and P(cherry) = 10 . Let X be the random variable “number of successful flavors launched.” Obtain its probability mass function. Solution: Let L = lemon, S = strawberry, P = peach, and C = cherry launched successfully.

K14521_SM-Color_Cover.indd 97

30/06/15 11:46 am

92

Probability and Statistics with R, Second Edition: Exercises and Solutions

P(X = 4) = P(L ∩ S ∩ P ∩ C) 2 3 4 5 = · · · 10 10 10 10 = 0.012 P(X = 3) = P(L ∩ S ∩ P ∩ C c ) + P(L ∩ S ∩ P c ∩ C) + P(L ∩ S c ∩ P ∩ C) + P(Lc ∩ S ∩ P ∩ C) 2·3·4·5 2·3·6·5 2·7·4·5 8·3·4·5 + + + = 104 104 104 104 = 0.106 P(X = 2) = P(L ∩ S ∩ P c ∩ C c ) + P(L ∩ S c ∩ P c ∩ C) + P(Lc ∩ S c ∩ P ∩ C) + P(Lc ∩ S ∩ P c ∩ C) + P(Lc ∩ S ∩ P ∩ C c ) + P(L ∩ S c ∩ P ∩ C c ) 2·3·6·5 2·7·6·5 8·7·4·5 8·3·6·5 8·3·4·5 2·7·4·5 + + + + + = 104 104 104 104 104 104 = 0.32 P(X = 1) = P(L ∩ S c ∩ P c ∩ C c ) + P(Lc ∩ S ∩ P c ∩ C c ) + P(Lc ∩ S c ∩ P ∩ C c ) + P(Lc ∩ S c ∩ P c ∩ C) 2·7·6·5 8·3·6·5 8·7·4·5 8·7·6·5 + + + = 104 104 104 104 = 0.394 P(X = 0) = P(Lc ∩ S c ∩ P c ∩ C c ) 8·7·6·5 = 104 = 0.168

x p(x)

0 0.168

1 0.394

2 0.320

3 0.106

4 0.012

37. John and Peter play a game with a coin such that P(head) = p. The game consists of tossing a coin twice. John wins if the same result is obtained in the two tosses, and Peter wins if the two results are different. (a) At what value of p is neither of them favored by the game?

(b) If p is different from your answer in (a), who is favored?

Solution: P(John wins) = p2 + (1 − p)2 and P(Peter wins) = 1 − P(John wins).

K14521_SM-Color_Cover.indd 98

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

93

(a) When P(John wins) = P(Peter wins) = 1/2, the game is fair. 1 2 2p2 + 2(1 − p)2 = 1

P(John wins) = p2 + (1 − p)2 =

4p2 − 4p + 1 = 0 (2p − 1)2 = 0 1 =⇒ p = 2

If p = 1/2 both of them have the same probability of winning the game. (b) Since (2p − 1)2 > 0 for all p = 1/2, John wins for any different answer than that in (a). 38. A bank is going to place a security camera in the ceiling of a circular hall of radius r. What is the probability that the camera is placed nearer the center than the outside circumference if the camera is placed at random? Solution: If the camera is to be placed nearer the center, it will be placed inside a circle of radius 2 if the entire room has a radius of r. The likelihood the camera falls in this circle is 2 π (r /2 ) = 1/4 . 2 πr

r/

39. Let the random variable X be the sum of the numbers on two fair dice. Find an upper bound on P(|X − 7| ≥ 4) using Chebyshev’s Inequality as well as the exact probability for P(|X − 7| ≥ 4). Solution: The probability density is x 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 5 4 3 2 1 , p(x) 36 36 36 36 36 36 36 36 36 36 36 which implies µX = x x · p(x) = 7 and σx2 = x (x − µx )2 · p(x) = 5.83. The bound given by Chebyshev’s Inequality says σx2 k2 5.83 P(|X − 7| ≥ 4) ≤ 2 4 =⇒ P(|X − 7| ≥ 4) ≤ 0.3645833 P(|X − µx | ≥ k) ≤

> > > >

dicerolls >

MX > > > > > > > > > > > > > >

K14521_SM-Color_Cover.indd 100

par(pty = "s") x > >

library(MASS) SS >

x > > + + > >

set.seed(13) sims > > > > > > >

x > > > > > > > > > > > > > + > > > > > > > > > > >

opar >

0.0

0.2

0.4

0.6 x

K14521_SM-Color_Cover.indd 109

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

104

(b) f (x) = kx2 , 0 < x < 2.

So, f (x) = 38 x2 and F (x) =

x

3 2 y 0 8

2

set

kx2 dx = 1 0

2 kx3 =1 3 0 8 k· =1 3 3 k= 8

dy =

x3 8 .

> g k k [1] 0.375 f > >

0.0

0.5

1.0

1.5

2.0

0.0

x

0.5

1.0

1.5

2.0

x

√ (c) f (x) = k x/2, 0 < x < 1.

So, f (x) =

K14521_SM-Color_Cover.indd 110

√ 3 x 2

and F (x) =

x 0

√ 3 y 2

1 0

√ k x set dx = 1 2 1 kx3/2 =1 3 0 1 k· =1 3 k=3

dy = x3/2 .

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

105

> g k k [1] 2.999999 f > >

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

x

0.4

0.6

0.8

1.0

x

51. Given the function f (x) = k,

−1 < x < 1

of the random variable X, find the coefficient of skewness for the distribution. Solution: 1

set

k dx = kx|1−1 = k(1 − (−1)) = 2k = 1 =⇒ k = 1/2 . E [(X−µ)3 ] Skewness is γ1 = , so expected value and variance must be calculated as must σ3 E[X 3 ]. First, find k,

−1

µ = E[X] = and σ 2 = Var[X] =

3

1 −1

E[X ] =

1 −1

1 x2 x dx = =0 2 4 −1

1 x3 1 2 (x − 0)2 dx = = = 2 6 −1 6 3 1

−1

1 x3 x4 dx = =0 2 8 −1

Since Var[X] > 0 and E[(X − µ)3 ] = E[X 3 ] = 0, the skewness is zero.

K14521_SM-Color_Cover.indd 111

30/06/15 11:46 am

106 > > > > >

Probability and Statistics with R, Second Edition: Exercises and Solutions

f > > > >

SS > >

Y integrate(f, 5, 8)$value

# P(X integrate(f, 6, 10)$value

# P(X >=6)

[1] 0.96 > integrate(f, 7, 8)$value

# P(7 < X < 8)

[1] 0.2

56. The number of bottles of milk that a dairy farm fills per day is a random variable with mean 5000 and standard deviation 100. Assume the farm always has a sufficient number of glass bottles to be used to store the milk. However, for a bottle of milk to be sent to a grocery store, it must be hermetically sealed with a metal cap that is produced on site. Calculate the minimum number of metal caps that must be produced on a daily basis so that all filled milk bottles can be shipped to grocery stores with a probability of at least 0.9. Solution: Let X = number of milk bottles filled per day. X ∼ (µ = 5000, σ = 100).

K14521_SM-Color_Cover.indd 116

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

111

σ2 k2 1002 P(|X − 5000| < k) ≥ 1 − 2 k P(|X − µ| < k) ≥ 1 −

Since the probability requested is at least 0.9, 1−

1002 ≥ 0.9 k2 0.1k 2 ≥ 1002

k 2 ≥ 100, 000 k ≥ 100, 000 = 316.2278 =⇒ k = 317 If 5317 metal caps are produced on a daily basis, all filled milk bottles can be shipped to grocery stores with a probability of at least 0.9. 57. Define X as the space occupied by certain device in a 1 m3 container. The probability density function of X is given by f (x) =

630 4 x 1 − x4 , 56

0 < x < 1.

(a) Graph the probability density function. (b) Calculate the mean of X by hand. (c) Calculate the variance X by hand. (d) Calculate P(0.20 < X < 0.80) by hand. (e) Calculate the mean of X using integrate(). (f) Calculate the variance of X using integrate(). (g) Calculate P(0.20 < X < 0.80) using integrate().

Solution: (a) > f curve(f, 0, 1)

K14521_SM-Color_Cover.indd 117

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

1.5 0.0

0.5

1.0

f(x)

2.0

2.5

112

0.0

0.2

0.4

0.6

0.8

1.0

x

(b)

E[X] =

0

1

x·

630 = 56

630 4 x 1 − x4 dx 56

1

x5 − x9 dx 1 630 x6 x10 = − 56 6 10 0 1 630 1 − −0 = 56 6 10 3 630 2 · = = 56 30 4 0

(c) First, find E[X 2 ].

K14521_SM-Color_Cover.indd 118

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

2

E[X ] =

1 0

x2 ·

630 = 56

1

113

630 4 x 1 − x4 dx 56

x6 − x1 0 dx 1 630 x7 x11 = − 56 7 11 0 630 1 1 = − −0 56 7 11 45 630 4 · = = 0.5844 = 56 77 77 0

Var[X] = E[X 2 ] − (E[X])2 =

9 27 45 − = = 0.0219 77 16 1232

(d)

0.8

630 4 x 1 − x4 dx 0.2 56 0.8 630 x5 x9 = − 56 5 9 0.2 5 (0.8) (0.8)9 (0.2)5 (0.2)9 630 − − − = 56 5 9 5 9 222, 183 = 0.5688 = 390, 625

P(0.2 < X < 0.8) =

(e) > fxe EX EX [1] 0.75 (f) > > > >

fxf 0 f (x) = 50 0 x ≤ 0. Using R,

K14521_SM-Color_Cover.indd 122

30/06/15 11:46 am

Chapter 3:

General Probability and Random Variables

117

(a) Find the probability that a car stays more than 1 hour. (b) Let Y = 0.5 + 0.03X be the cost in dollars that the mall has to pay a security service per parked car. Find the mean parking cost for 1000 cars. (c) Find the variance and skewness coefficient of Y . Solution: (a) P(X > 60) = 0.3011942 > fx integrate(f = fx, lower = 60, upper = Inf)$value [1] 0.3011942 (b) E[X] = 50, Y = 0.5 + 0.03X, E[Y ] = 0.5 + 0.03E[X] = 2, and E[1000Y ] = 1000E[Y ] = 2000. > > > > >

Me

VX LT LT [1] 0.8952425

63. The time, in hours, a child practices his musical instrument on Saturdays has pdf k(1 − x) for 0 ≤ x ≤ 1 f (x) = 0 otherwise. (a) Find k to make f (x) a valid pdf. (b) Write the cdf and find the probability the child practices more than 48 minutes on a Saturday. Solution: (a) To be a pdf, the integral of f (x) must equal 1. 1=

1 0

1 k k k k(1 − x) dx = − (1 − x)2 = (1 − 0)2 = 2 2 2 0

The integral of f (x) is 1 when k = 2. (b) Since

it follows that

K14521_SM-Color_Cover.indd 124

x 0

x 2(1 − t) dt = −(1 − t)2 = 1 − (1 − x)2 = x(2 − x), 0

  0 F (x) = x(2 − x)   1

x 48/60 ) = 1 − P(X ≤ 0.8) = 1 − F (0.8) = 1 − 0.8(2 − 0.8) = 0.04.

> f integrate(f, 0.8, 1)$value [1] 0.04

64. Consider the equilateral triangle ABC with side l. Given a randomly chosen point R in the triangle, calculate the cumulative and the probability density functions for the distance from R to the side BC. Construct a graph of the cumulative density function for √ different values of l. (Hint: The equation of the line CA is y = 3x.)

A

l

l

R

M

N

x

C

B l

Solution: Let X = the distance from R to side BC. of CM N B The P(X ≤ x) = Area Area of ABC Area of CM N B =

1 2

Area of ABC = 12 l ·

K14521_SM-Color_Cover.indd 125

[length(M N ) + length(CB)] · x =

√

3 2 l

=

l2

√ 4

3

1 2

l−

2x √ 3

+ l · x = lx −

x2 √ 3

30/06/15 11:46 am

120

Probability and Statistics with R, Second Edition: Exercises and Solutions

So, P(X ≤ x) =

x2 √ 3 √ l2 3 4

lx −

=

√

√ 4 x2 4x(l 3 − x) √ · lx − √ = FX (x) = 3l2 l2 3 3

Note 0 ≤ x ≤ l 2 3 . √ (x) = 4l 3l3−8x when 0 ≤ x ≤ f (x) = FX 2

Fx + + > + + + > + + + + +

√ l 3 2

0.0

Side Length (l) = 1 Side Length (l) = 2 Side Length (l) = 3 Side Length (l) = 4 Side Length (l) = 5 0

1

2

3

4

Distance to CB

K14521_SM-Color_Cover.indd 126

30/06/15 11:46 am

Chapter 4 Univariate Probability Distributions

1. Let X be a Poisson random variable with mean equal to 2. Find P(X = 0), P(X ≥ 3), and P(X ≤ k) ≥ 0.70. Solution: P(X = 0) = 0.1353, P(X ≥ 3) = 0.3233, and P(X ≤ k) ≥ 0.70 =⇒ k = 3. > dpois(0, 2) [1] 0.1353353 > ppois(2, 2, lower = FALSE) [1] 0.3233236 > qpois(0.7, 2) [1] 3

2. Let X be an exponential random variable Exp(λ = 3). Find P(2 < X < 6). Solution: P(2 < X < 6) = 0.0025 > pexp(6, 3) - pexp(2, 3) [1] 0.002478737

3. Let X be a normal random variable N (µ = 7, σ = 3). Calculate P(X > 7.1). Find the value of k such that P(X < k) = 0.8. Solution: P(X > 7.1) = 0.4867, and k = 9.5249 if P(X < k) = 0.8. > pnorm(7.1, 7, 3, lower = FALSE) [1] 0.4867044 > qnorm(0.80, 7, 3) [1] 9.524864

121

K14521_SM-Color_Cover.indd 127

30/06/15 11:46 am

122

Probability and Statistics with R, Second Edition: Exercises and Solutions

√ 4. Let X be a normal random variable N µ = 3, σ = 0.5 . Calculate P(X > 3.5). Solution: P(X > 3.5) = 0.2398

> pnorm(3.5, 3, sqrt(0.5), lower = FALSE) [1] 0.2397501

5. Let X be a gamma random variable Γ(α = 2, λ = 6). Find the value a such that P(X < a) = 0.95. Solution: a = 0.7906 if P(X < a) = 0.95

> qgamma(0.95, 2, 6) [1] 0.7906441

6. Construct a plot for the probability mass function and the cumulative probability distribution of a binomial random variable Bin(n = 8, π = 0.3). Find the smallest value of k such that P(X ≤ k) ≥ 0.44 when X ∼ Bin(n = 8, π = 0.7). Calculate P(Y ≥ 3) if Y ∼ Bin(20, 0.2). Solution:

> > + + + > > + + +

K14521_SM-Color_Cover.indd 128

DF1 sum(dbinom(42:54, 60, 0.80)) [1] 0.9658254

9. It is known that 3% of the seeds of a certain variety of tomato do not germinate. The seeds are sold in individual boxes that contain 20 seeds per box with the guarantee that at least 18 seeds will germinate. Find the probability that a randomly selected box does not fulfill the aforementioned requirement. Solution: Let X = number of seeds that germinate. X ∼ Bin(n = 20, π = 0.97), P(X ≤ 17) = 0.021. > pbinom(17, 20, 0.97) [1] 0.02100836

10. A garage has two machines, A and B, to balance the wheels of a car. Suppose that 95% of the wheels are correctly balanced by machine A, while 85% of the wheels are correctly balanced by machine B. A machine is randomly selected to balance 20 wheels, and 3 of them are not properly balanced. What is the probability that machine A was used? What is the probability machine B was used? Solution: Let A = machine A was used to balance the wheels and B = machine B was used to balance the wheels. Let E = the event that 3 out of 20 wheels are balanced improperly. If X = the number of wheels balanced improperly, then P(E|A) = P(X = 3) where X ∼ Bin(20, 0.05) = 0.0596 and P(E|B) = P(X = 3) where X ∼ Bin(20, 0.15) = 0.2428 P(A|E) = = = P(B|E) = = =

K14521_SM-Color_Cover.indd 130

P(A ∩ E) P(E) P(E|A) · P(A) P(E|A) · P(A) + P(E|B) · P(B) 0.0596 × 0.5 = 0.197 0.0596 × 0.5 + 0.2428 × 0.5 P(B ∩ E) P(E) P(E|B) · P(B) P(E|A) · P(A) + P(E|B) · P(B) 0.2428 × 0.5 = 0.803 0.0596 × 0.5 + 0.2428 × 0.5

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

125

11. Traffic volume is an important factor for determining the most cost-effective method to surface a road. Suppose that the average number of vehicles passing a certain point on a road is 2 every 30 seconds. (a) Find the probability that more than 3 cars will pass the point in 30 seconds. (b) What is the probability that more than 10 cars pass the point in 3 minutes? Solution: (a) Let X = number of cars passing a certain point on the road in 30 seconds. X ∼ Pois(λ = 2). P(X > 3) = 1 − P(X 1 - ppois(3, 2) [1] 0.1428765 (b) Let Y = number of cars passing a certain point on the road in 3 minutes. Y ∼ Pois(λ = 12). P(Y > 10) = 1 − P(X 1 - ppois(10, 12) [1] 0.6527706

12. The retaining wall of a dam will break if it is subjected to the pressure of two floods. If the average number of floods in a century is two, find the probability that the retaining wall lasts more than 20 years. Solution: Let X = number of floods a retaining wall is subjected to each year. X ∼ Pois(λ = 2/100). Let W = waiting time until the αth flood. The probability the retaining wall lasts more than 20 years is P(W > 20) = 0.9384 where W ∼ Γ(2, 2/100). > pgamma(20, 2, 2/100, lower = FALSE) [1] 0.9384481

13. A particular competition shooter hits his targets 70% of the time with any pistol. To prepare for shooting competitions, this individual practices with a pistol that holds 5 bullets on Tuesday, Thursday, and Saturday, and a pistol that holds 7 bullets the other days. If he fires at targets until the pistol is empty, find the probability that he hits only one target out of the bullets shot in the first round of bullets in the pistol he is carrying that day. In this case, what is the probability that he used the pistol with 7 bullets? Solution: If he fires at targets until the pistol is empty, the probability that he hits only one target out of the bullets shot in the first round of bullets in the pistol he is carrying that day is 0.0142. Let X = number of shots that hit the target. X|T RS ∼ Bin(5, 0.7) and X|M W F S ∼ Bin(7, 0.7).

K14521_SM-Color_Cover.indd 131

30/06/15 11:46 am

126

Probability and Statistics with R, Second Edition: Exercises and Solutions

P(X = 1) = P(X = 1|T RS)P (T RS) + P(X = 1|M W F S)P (M W F S) 4 5 7 1 4 3 = (0.7) (0.3) · + (0.7)1 (0.3)6 · 1 7 1 7 = 0.0142

> AnsA AnsA [1] 0.0141912 The probability that he used the pistol with 7 bullets if he hits only one target is 0.1438. P(X = 1|M W F S)P (M W F S) P(X = 1) 7 (0.7)1 (0.3)6 · 47 1 = 0.0142 = 0.1438

P(M W F S|X = 1) =

> AnsB AnsB [1] 0.1438356

14. The lifetime of a certain engine follows a normal distribution with mean and standard deviation of 10 and 3.5 years, respectively. The manufacturer replaces all catastrophic engine failures within the guarantee period free of charge. If the manufacturer is willing to replace no more than 4% of the defective engines, what is the largest guarantee period the manufacturer should advertise? Solution: Let EF ∼ N (10, 3.5), Given P(EF ≤ k) = 0.04 =⇒ k = 3.8726 years is the largest guarantee period the manufacturer should advertise. > qnorm(0.04, 10, 3.5) [1] 3.872599

15. Agronomists are developing an improved variety of green peppers. Supermarket managers have indicated customers are not likely to purchase green peppers weighing less

K14521_SM-Color_Cover.indd 132

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

127

than 45 grams. The current variety of green pepper plants produces green peppers that weigh 48 grams on average, but 13% weigh less than 45 grams. Assume the weight of the current variety of green peppers follows a normal distribution. (a) What is the standard deviation of the weights of the current variety of green peppers? (b) The agronomists want to reduce the frequency of green peppers weighing less than 45 grams to no more than 5%. One way to reduce the frequency of underweight green peppers is to increase the weight of the green peppers. If the standard deviation remains the same, what mean should the agronomists target as a goal? (c) The agronomists produce a new variety of green peppers with a mean weight of 50 grams, which meets the 5% goal. What is the standard deviation of the weights of these new green peppers? (d) Does the current variety or the new variety produce a green pepper with a more consistent weight? Solution: (a) > z z [1] -1.126391 > sig sig [1] 2.663373

x−µ σ 45 − 48 −1.1264 = σ −1.1264 · σ = −3 −3 = 2.6634 σ= −1.1264 z0.13 =

The standard deviation for the current variety of green peppers is 2.6634 grams. (b) > z z [1] -1.644854 > mu mu [1] 49.38086

K14521_SM-Color_Cover.indd 133

30/06/15 11:46 am

128

Probability and Statistics with R, Second Edition: Exercises and Solutions

x−µ σ 45 − µ −1.6449 = 2.6634 −1.6449 · 2.6634 = −4.3809 = 45 − µ µ = 45 − −4.3809 = 49.3809 z0.05 =

The agronomists should attempt to create green peppers with a mean weight of 49.3809 grams. (c) > z sig2 sig2 [1] 3.039784 x−µ σ 45 − 50 −1.6449 = σ 45 − 50 = 3.0398 σ= −1.6449 z0.05 =

The standard deviation for the new variety of green peppers is 3.0398 grams. (d) Since the standard deviation of the new variety (3.0398 grams) is greater than the standard deviation of the current variety (2.6634 grams), the new variety is less consistent with respect to weight than the current variety. 16. Given independent random variables Y1 , Y2 , X, W, Z1 , Z2 , and Z3 , (a) Compute P ((Y2 ≥ 3) ∪ (Y1 < 9)) if Y1 ∼ Bin(n = 10, π = 0.3) and Y2 ∼ Bin(n = 5, π = 0.1). (b) Compute P(X ≥ 2|X < 6) if X ∼ Pois(λ = 4). (c) If W ∼ N (µ, σ), find the value of k that satisfies the equation P(µ < W < µ + 2kσ) = 0.45. Z12 + Z22 + Z32 > 1.5 . (d) If Zi ∼ N (0, 1) for i = 1, 2, 3, compute P Solution: (a) > > > > >

K14521_SM-Color_Cover.indd 134

P1

PN 1.5 = P χ23 > (1.5)2 = 0.5222

K14521_SM-Color_Cover.indd 135

30/06/15 11:46 am

130

Probability and Statistics with R, Second Edition: Exercises and Solutions

17. Derive the mean and variance for the discrete uniform distribution. n n(n + 1) n n(n + 1)(2n + 1) ; i=1 x2i = , when xi = 1, 2, . . . , n.) (Hints: i=1 xi = 2 6

Solution:

X is a discrete uniform, which means it takes on values in {1, 2, 3, . . . , n} with probability 1 n each. n n n E[X] = i=1 xi · P(X = xi ) = i=1 i · n1 = n1 i=1 i = n1 · n(n+1) = n+1 2 2 n n n E[X 2 ] = i=1 x2i · P(X = xi ) = i=1 i2 · n1 = n1 i=1 i2 = n1 · n(n+1)(2n+1) = (n+1)(2n+1) 6 6 2

2

Var[X] = E[X ] − (E[X]) = = = =

2 n+1 (n + 1)(2n + 1) − 6 2 2n2 + 3n + 1 n2 + 2n + 1 − 6 4 4n2 + 6n + 2 − 3n2 − 6n − 3 12 n2 − 1 12

18. Suppose the percentage of drinks sold from a vending machine are 80% and 20% for soft drinks and bottled water, respectively. (a) What is the probability that on a randomly selected day, the first soft drink is the fourth drink sold? (b) Find the probability that exactly 1 out of 10 drinks sold is a soft drink. (c) Find the probability that the fifth soft drink is the seventh drink sold. (d) Verify empirically that P Bin(n, π) ≤ r − 1 = 1 − P NB (r, π) ≤ (n − r) , with n = 10, π = 0.8, and r = 4. Solution: Let X = number of waters (failures) purchased before the first soft drink is purchased. Then, X ∼ Geo(0.80). (a) P(X = 3) = 0.0064 or use X ∼ NB (1, 0.80). > dgeom(3, 0.80) [1] 0.0064 > dnbinom(3, 1, 0.80) [1] 0.0064 (b) Let X = number of soft drinks sold. Then, X ∼ Bin(10, 0.80) and P(X = 1) = 0.

K14521_SM-Color_Cover.indd 136

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

131

> dbinom(1, 10, 0.80) [1] 4.096e-06 (c) Let X = number of water purchased before the fifth soft drink is purchased. Then, X ∼ NB (5, 0.80) and P(X = 2) = 0.1966. > dnbinom(2, 5, 0.80) [1] 0.196608 (d) > A B c(A, B) [1] 0.0008643584 0.0008643584

19. The hardness of a particular type of sheet metal sold by a local manufacturer has a normal distribution with a mean of 60 micra and a standard deviation of 2 micra. (a) This type of sheet metal is said to conform to specification provided its hardness measure is between 57 and 65 micra. What percent of the manufacturer’s sheet metal can be expected to fall within the specification? (b) A building contractor agrees to purchase metal from the local metal manufacturer at a premium price provided four out of four randomly selected pieces of metal test between 57 and 65 micra. What is the probability the building contractor will purchase metal from the local manufacturer and pay a premium price? (c) If an acceptable sheet of metal is one whose hardness is not more than c units away from the mean, find c such that 97% of the sheets are acceptable. (d) Find the probability that at least 10 out of 20 sheets have a hardness greater than 60. Solution: (a) Given X ∼ N (60, 2), P(57 < X < 65) = 0.927. The percent of the manufacturer’s sheet metal that can be expected to fall within the specification is 92.6983%. > p p [1] 0.9269831 (b) Let Y = number of sheets that test between 57 and 65 micra. Then, Y ∼ Bin(4, 0.927) and P(Y = 4) = 0.7384, so there is a 0.7384 chance the contractor will purchase from the local manufacturer and pay a premium price. > dbinom(4, 4, p) [1] 0.7383926 (c) P(X < k) = 0.985 =⇒ k = 64.3402 =⇒ c = 4.3402

K14521_SM-Color_Cover.indd 137

30/06/15 11:46 am

132

Probability and Statistics with R, Second Edition: Exercises and Solutions

> k INC c(k, INC) [1] 64.340181

4.340181

(d) Let W = number of sheets that have hardness greater than 60. Then, W ∼ Bin(20, 0.5), and P(W ≥ 10) = 1 − P(W ≤ 9) = 0.5881. > 1 - pbinom(9, 20, 0.5) [1] 0.5880985

20. The weekly production of a banana plantation can be modeled with a normal random variable that has a mean of 5 tons and a standard deviation of 2 tons. (a) Find the probability that, in at most 1 out of the 8 randomly chosen weeks, the production has been less than 3 tons. (b) Find the probability that at least 3 weeks are needed to obtain a production greater than 10 tons. Solution: (a) Let X = number of weeks where production is less than 3 tons. Then, X ∼ Bin(8, 0.1587). Let W = production, W ∼ N (5, 2) and P(W ≤ 3) = 0.1587. So, P(X ≤ 1) = 0.6298. > p pbinom(1, 8, p)

# P(W < 3)

[1] 0.6298268 (b) Let Xn = number of weeks where production is less than 10 tons before the first week with production over 10 tons. Xn ∼ Geo(π = P(W ≥ 10) = 0.0062), and P(Xn ≥ 2) = 1 − P(Xn ≤ 1) = 0.9876. > PI PI

# P(W >= 10)

[1] 0.006209665 > 1 - pgeom(1, PI) [1] 0.9876192

21. A bank has 50 deposit accounts with e 25,000 each. The probability of having to close a deposit account and then refund the money in a given day is 0.01. If account closings are independent events, how much money must the bank have available to guarantee it can refund all closed accounts in a given day with probability greater than 0.95? Solution:

K14521_SM-Color_Cover.indd 138

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

133

Let X = number of accounts closed and refunded out of 50. Then, X ∼ Bin(50, 0.01). The minimum number of accounts closed where the probability is at least 0.95 occurs where P(X ≤ k) ≥ 0.95 =⇒ k = 2. Therefore, the bank must have on hand 2×25, 000 = e 50, 000. > k k [1] 2 > euros euros [1] 50000

22. The mean number of calls a tow truck company receives during a day is 5 per hour. Find the probability that a tow truck is requested more than 4 times per hour in a given hour. What is the probability the company waits for less than 1 hour before the tow truck is requested 3 times? Solution: Let X = number of times per hour the tow truck is requested. X ∼ Pois(λ = 5) and P(X > 4) = 1 − P(X ≤ 4) = 0.5595. > 1 - ppois(4, 5) [1] 0.5595067 Let W = the waiting time before the tow truck is requested 3 times. W ∼ Γ(3, 5) and P(W < 1) = 0.8753. The probability less than 1 hour goes by before the tow truck is requested 3 times is 0.8753. > pgamma(1, 3, 5) [1] 0.875348

23. In the printing section of a plastics company, a machine receives on average 6 buckets per minute to be painted and paints them. The machine has been out of service for 90 seconds due to a power failure. (a) Find the probability that more than 8 buckets remain unpainted. (b) Find the probability that the first bucket, after the electricity is restored, arrives before 10 seconds have passed. Solution: (a) Let X = number of buckets received per minute. Then, X ∼ Pois(λ = 6). If 6 buckets arrive in 60 seconds, 9 buckets are expected to arrive in 90 seconds. Let Y = number of buckets that arrive in 90 seconds. It follows then that Y ∼ Pois(λ = 9). Assume an employee is hand transporting the buckets at the same rate given in the problem. P(Y > 8) = 1 − P(Y ≤ 8) = 0.5443

K14521_SM-Color_Cover.indd 139

30/06/15 11:46 am

134

Probability and Statistics with R, Second Edition: Exercises and Solutions

> 1 - ppois(8, 9) [1] 0.5443474 (b) Let T = time until the first bucket arrives. T ∼ Γ(1, 6) and P(T < 1/6) = 0.6321. > pgamma(1/6, 1, 6) [1] 0.6321206

24. Give a general expression to calculate the quantiles of a Weibull random variable. Solution: xj

The j th quantile (0 ≤ j ≤ 1) is the value xj such that For a Weibull, f (x) =

−∞

α

αβ −α xα−1 e−(x/β) 0

F (xj ) =

f (x) dx = j.

x≥0 x pweibull(12, 3, 25) [1] 0.104696

K14521_SM-Color_Cover.indd 140

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

135

Since there is a 10.4696% chance of break down in the first 12 months and there are 50 cars, we can expect 5.2348 cars to break down during the guarantee period. Consequently, the expected price of the guarantee is $4187.8417. > EBD EBD

# E(Break Downs)

[1] 5.234802 > EGP EGP

# E(Price of Guarantee)

[1] 4187.842 > + + + + + + > > > + + + +

limitRange >

set.seed(500) rs > > + + > > >

set.seed(50) rs > > >

K14521_SM-Color_Cover.indd 143

f1 abs(var(Volume) - VV)/VV*100 [1] 0.2798443 The simulated expected value of the volume is 0.3033% from the theoretical expected value of the volume, and the simulated variance of the volume is 0.2798% from the theoretical variance of the volume. 29. Let X be a random variable with probability density function f (x) = 3

4 1 , x

x ≥ 1.

(a) Find the cumulative density function. (b) Fix the seed at 98 (set.seed(98)), and generate a random sample of size n = 100, 000 from X’s distribution. Compute the mean, variance, and coefficient of skewness for the random sample. (c) Obtain the theoretical mean, variance, and coefficient of skewness of X. (d) How close are the estimates in (b) to the theoretical values in (c)? Solution: (a) 4 1 dt t 1 x = −t−3 1

FX (x) =

x

3

= −x−3 + 1

(b) To do the generation, find the relationship between a uniform and X.

K14521_SM-Color_Cover.indd 144

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

139

set

FX (x) = −x−3 + 1 = u

−x−3 = u − 1 x−3 = 1 − u

x = (1 − u)−1/3

x = 1/(1 − u)1/3 > > > > > > >

set.seed(98) n 2. (d) To do the generation, find the relationship between a uniform and X. set

1 − x−θ = u

x−θ = 1 − u 1

x = (1 − u)− θ = > > > >

K14521_SM-Color_Cover.indd 146

1 (1 − u)1/θ

set.seed(42) n .75)? (d) Fix the seed at 13 (set.seed(13)), and generate 100,000 realizations of X. What are the mean and variance of the random sample? (e) Calculate the theoretical mean and variance of X. (f) How close are the estimates in (d) to the theoretical values in (e)? Solution: (a)

K14521_SM-Color_Cover.indd 147

1 0

f (x) dx =

1

4 x(2 − x2 ) dx 0 3

=

1

8x 0 3

−

4x3 3

dx =

4x2 3

−

1

x4 3

0

=

4 3

−

1 3

− (0 − 0) = 1

30/06/15 11:46 am

142

Probability and Statistics with R, Second Edition: Exercises and Solutions

> f integrate(f, 0, 1)$value [1] 1 (b) FX (x) =

x

8t 0 3

−

4t3 3

dt =

4t2 3

−

x

t4 3

0

=

4x2 3

−

x4 3 ,0

≤x≤1

(c) P(X > 0.75) = 1 − F (.75) = 0.3555 > 1 - (4*0.75^2/3 - 0.75^4/3) [1] 0.3554688 > # or > integrate(f, 0.75, 1)$value [1] 0.3554688

(d) Set the cdf equal to u from a uniform. set

u = FX (x) x4 4x2 − 3 3 x4 − 4x2 + 3u = 0 u=

Let y = x2 .

y 2 − 4y + 3u = 0

16 − 4(1)(3u) 2 √ y = 2 ± 4 − 3u 4±

=⇒ y =

Since x must fall between 0 and 1, only 2 − x= > > > > > > >

√ 4 − 3u is a viable solution for y. This means

2−

√

4 − 3u

set.seed(13) n + + >

# or f > > > > >

n 1|θ = 5) = 1 − P (X ≤ 1|θ = 5) = 0 > Fx 1 - Fx(1) [1] 1.507017e-07 > > + + >

# Or f > > > > > >

set.seed(201) n = 0.1 σ σ 0.13 − 0.11 = 0.1 P Z> σ 0.13 − 0.11 =⇒ Z0.9 = σ 1.2816σ = 0.02 σ = 0.0156

∴ X ∼ N (0.11, 0.0156) > mu sigma c(mu, sigma) [1] 0.11000000 0.01560608 (b) P(0.10 < X < 0.13) = 0.6392

K14521_SM-Color_Cover.indd 154

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

149

> Area Area [1] 0.6391658 (c) Let Y = number of usable conductor cables. Y ∼ Bin(n = 5, π = 0.6392). P(Y ≥ 3) = 1 − P (Y ≤ 2) = 0.7478 > 1 - pbinom(2, 5, Area) [1] 0.7477729

35. Abinomial, Bin(n, π), distribution can be approximated by a normal distribution, N nπ, nπ(1 − π) , when nπ > 10 and n(1 − π) > 10. The Poisson distribution can also √ be approximated by a normal distribution N λ, λ if λ > 10. Consider a sequence from 5 to 27 of a variable X (binomial or Poisson) and show that for n = 80, π = 0.2, and λ = 16 the aforementioned approximations are appropriate. The normal approximation to a discrete distribution can occasionally be improved by adding 0.5 to the normal random variable when finding the area to the left of said random variable. Specifically, create a table showing P(X ≤ x) for the range of X for the four distributions and a graph showing the density of the normal distributions with vertical lines showing P(X = x) for the binomial and Poisson distributions, respectively. Solution: > > > > > > > >

n

151

x choose(4, 3)*choose(4, 1)/choose(8, 4) [1] 0.2285714 2 2 38. Consider thefunction g(x) = (x − a) is finite. , where a is a constant and E (X − a) 2 Find a so that E (X − a) is minimized.

Solution:

h(a) = E[(x − a)2 ] = E[X 2 ] − 2aE[X] + a2 . set

Then h (a) = −2E[X] + 2a = 0 =⇒ a = E[X]. Since, h (a) = 2 > 0, a = E[X] minimizes h(a). 39. Consider the random variable X ∼ Weib(α, β). (a) Find the cdf for X. (b) Use the definition of the hazard function to verify that for X ∼ Weib(α, β), the hazard αtα−1 function is given by h(t) = . βα Solution: α

(a) For X ∼ Weib(α, β), f (x) = αβ −α xα−1 e−(x/β) if x ≥ 0, and f (x) = 0 if x < 0.

x

α

αβ −α tα−1 e−(t/β) dt 0 α x = −e−(t/β)

FX (x) =

=1−e

0 −(x/β)α

for x ≥ 0

(b) h(t) =

f (t) 1 − F (t)

α

αβ −α tα−1 e−(t/β) = α 1 − (1 − e−(t/β) ) α

αβ −α tα−1 e−(t/β) α e−(t/β) αtα−1 = αβ −α tα−1 = βα

=

K14521_SM-Color_Cover.indd 158

30/06/15 11:46 am

Chapter 4:

Univariate Probability Distributions

153

40. If X ∼ Bin(n, π), use the binomial expansion to find the mean and variance of X. To x 1 = (x−1)! find the variance, use the second factorial moment E[X(X − 1)] and note that x! when x > 0. Solution: X ∼ Bin(n, π) =⇒ P(X = x) = Mean:

n

π x (1 − π)n−x .

x

n x π (1 − π)n−x x· E[X] = x x=0 n

=

n

x · n! π x (1 − π)n−x x!(n − x)! x=0

Note that the first term in the sum (x = 0) is zero =

n

n! π x (1 − π)n−x (x − 1)!(n − x)! x=1

Let k = x − 1. =

n−1

n! π k+1 (1 − π)n−k−1 k!(n − k − 1)!

k=0

= nπ

n−1 k=0

∴ E[X] = nπ

(n − 1)! π k (1 − π)n−1−k k!(n − 1 − k)!

Sum of all Bin(n − 1, π) probabilities = 1

Finding E[X(X − 1)]: E[X(X − 1)] =

n

x=0

x(x − 1) ·

n! π x (1 − π)n−x x!(n − x)!

Note that the first two terms in the sum (x = 0, 1) are zero =

n

n! π x (1 − π)n−x (x − 2)!(n − x)! x=2

Let k = x − 2, so x = k + 2. =

n−2 k

n! π k+2 (1 − π)n−2−k k!(n − 2 − k)!

= π 2 n(n − 1) ∴ E[X(X − 1)] = π 2 n(n − 1)

K14521_SM-Color_Cover.indd 159

n−2

k

(n − 2)! π k (1 − π)n−2−k k!(n − 2 − k)!

Sum of all Bin(n − 2, π) probabilities = 1

30/06/15 11:46 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

154

The second factorial moment is E[X(X − 1)] = E[X 2 ] − E[X], so Var[X] = E[X(X − 1)] + E[X] − (E[X])2 = π 2 n(n − 1) + nπ − (nπ)2 = n2 π 2 − nπ 2 + nπ − n2 π 2 = nπ(1 − π)

41. The speed of a randomly chosen gas molecule in a certain volume of gas is a random variable, V , with probability density function

f (v) =

2 π

M RT

32

M v2

v 2 e− 2RT

for v ≥ 0

where R is the gas constant (= 8.3145 J/mol · K ), M is the molecular weight of the gas, and T is the absolute temperature measured in degrees Kelvin. (Hints:

∞ 0

k −x2

x e

1 dx = Γ 2

k+1 2

,

Γ (α + 1) = αΓ (α) , and

√ 1 Γ = π ). 2

(a) Derive a general expression for the average speed of a gas molecule. 2

(b) If 1 J = 1kg · m /s2 , what are the units for the answer in part (a)? (c) Kinetic energy for a molecule is Ek = kinetic energy of a molecule.

M v2 2 .

Derive a general expression for the average

(d) The weight of hydrogen is 1.008 g/mol . Note that there are 6.0221415 × 1023 molecules in 1 mole. Find the average speed of a hydrogen molecule at 300◦ K using the result from part (a). (e) Use numerical integration to verify the result from part (d). (f) Show the probability density functions for the speeds of hydrogen, helium, and oxygen on a single graph. The molecular weights for these elements are 1.008 g/mol , 4.003 g/ g mol , and 16.00 /mol , respectively.

Solution: (a) E[V ] =

0

K14521_SM-Color_Cover.indd 160

∞

v·

2 π

M RT

32

M v2

v 2 e− 2RT dv

30/06/15 11:46 am

Let y 2 =

M v2 2RT

Chapter 4: Univariate Probability Distributions 2RT ⇒ v = y 2RT and dv = M M dy

155

3 3 2RT 2RT M 2 −y 2 = dy e y RT M M 0 3 3 1 2 M 2 2RT 2 2RT 2 ∞ 3 −y2 y e dy = π RT M M 0 − 32 2 2 RT 3+1 1 2RT · Γ = π M M 2 2 12 2 RT 1 · Γ(2) =4 π M 2 2RT E[V ] = 2 πM

∞

2 π

(b) Finding units for 2 2RT πM :

R is

J/

mol · K ;

T is in K; M is in

Units ONLY of

kg mol

and a J is

2RT are πM

kg · m2 /

J mol·K · kg mol

K

=

s2

.

kg·m2 s2

kg

=

m2 m = s2 s

(c) 3 M v2 2 M 2 2 − M v2 · v e 2RT dv E[Ek ] = 2 π RT 0 3 ∞ M v2 M M 2 2 v 4 e− 2RT dv = 2 RT π 0

Let y 2 =

M v2 2RT

⇒v=y

∞

2RT M

M = 2

and dv =

M RT

32

2 π

4RT 1 4+1 = √ · Γ 2 π 2 √ 2RT 3 π = √ · · π 2 2 3RT = 2

K14521_SM-Color_Cover.indd 161

2RT M

∞ 0

dy

2RT M

4

4 −y 2

y e

2RT dy M

30/06/15 11:46 am

156

Probability and Statistics with R, Second Edition: Exercises and Solutions

(d)

2RT E[V ] = 2 πM J 2(8.3145 mol·K ) · 300K =2 g π · 1.008 mol 2(8.3145) · 300 J · =2 π · 1.008 g 2(8.3145) · 300 1000g · m2 · =2 π · 1.008 s2 · g m = 2510.259 s (e) √ Numerical integration for E[V ] gives an answer in (J/g)1/2 . To convert to m/s, multiply by 1000. > > > > + + + > >

M + + + > > + + + + + + + +

K14521_SM-Color_Cover.indd 162

M >

2 λ2

− µ2 =

2 λ2

set.seed(3) X1 + + + +

laplace VY VY [1] 8.070168 > abs(VY - 8)/8*100 [1] 0.8770954

43. A tombola is a raffle in which prizes are assigned to winning tickets. In a particular tombola, only 2 tickets out of n win a prize. After the two winning tickets are sold, a new tombola is started. The tickets are sold consecutively, and the prize is immediately announced when one person wins. Two friends have decided to play tombola in the following way: One of them buys the first ticket on sale, and the other one buys the first ticket after the first prize has been announced. Derive the probability that each of them wins a prize. If there are m tombolas during the night in which the two friends participate, what is the probability that each of them wins more than one prize? Solution: Let A = friend one buys the first ticket and wins, then P(A) = n2 . Let B = friend two buys the first ticket after the first prize is awarded and wins. Let Bi = friend two buys the first ticket after the first prize is awarded and wins with the ith ticket where i >1. n Note that P(B) = i=2 P(Bi ). Now P(B2 ) = P(B3 ) = P(B4 ) = P(B5 ) = P(B6 ) = .. . .. . P(Bi ) =

1 2 · n n−1 2 n−2 · n n−1 n−2 n−3 · n n−1 n−2 n−3 · n n−1 n−2 n−3 · n n−1

1 n−2 2 · n−2 n−4 · n−2 n−4 · n−2 ·

2 1 · n n−1 1 2 1 · = · n−3 n n−1 2 1 2 1 · · = · n−3 n−4 n n−1 n−5 2 1 2 1 · · · = · n−3 n−4 n−5 n n−1 =

n − (i − 1) 2 1 2 1 n−2 n−3 · ... · · = · . n n−1 n − (i − 3) n − (i − 2) n − (i − 1) n n−1

Thus P(B) =

n i=2

P (Bi ) =

n 2 1 2 1 2 · = (n − 1) · · = . n n−1 n n−1 n i=2

∴ P(B) = P(A) = Let X = the number of winning tickets purchase by friend one. X ∼ Bin(m, π = 2/n) 2 n

K14521_SM-Color_Cover.indd 167

30/06/15 11:47 am

162

Probability and Statistics with R, Second Edition: Exercises and Solutions

P(X > 1) = 1 − P(X ≤ 1) = 1 − P(X = 0) − P(X = 1) = m 1 m−1 0 2 2 m n−2 n−2 m − = 1− 1 0 n n n n m m−1 n−2 n−2 2 = 1− −m n n n m−1 (n − 2) (n + 2m − 2) = 1− . nm Since P(A) = P(B), the probability that each wins more than one prize is (n − 2)m−1 (n + 2m − 2) 2· 1− . nm

44. Consider the World Cup Soccer data stored in the data frame SOCCER. The observed and expected number of goals for a 90-minute game were computed in the “Poisson: World Cup Soccer Example” in this chapter. To verify that the Poisson rate λ is constant, compute the observed and expected number of goals with the time intervals 45, 15, 10, 5, and 1 minute(s). Compute the means and variances for both the observed and expected counts in each time interval. Based on the results, is the probability of exactly one outcome in a sufficiently short interval proportional to the length of the interval? Solution: There is very good agreement between the observed number of goals and the expected number of goals using a Poisson distribution regardless of the time period. Specifically, the probability of a goal in a given period is proportional to the length of the period. The function bigf() is written to answer the question. > + + + + + + + + + + + + + + + + + +

K14521_SM-Color_Cover.indd 168

bigf 0 and ρ = −1 when a < 0. Solution:

Cov [X, Y ] E[XY ] − E[X]E[Y ] = σX σY σX σY E[X(aX + b)] − E[X][aE[X] + b] aE[X 2 ] − aE[X]2 = = 2 σX |a|σX |a|σX

ρX,Y =

= If a > 0, then ρX,Y =

2 aCov [X, X] aσX a = 2 2 = |a| |a|σX |a|σX

a a = 1. If a < 0, then ρX,Y = = −1. |a| |a|

14. Given the joint density function f (x, y) = 6x,

0 < x < y < 1,

find the E[Y | X ] that is the regression line resulting from regressing Y on X. Solution: E[Y |X] =

K14521_SM-Color_Cover.indd 186

∞

−∞

yf (y|x) dy. f (y|x) = f (x, y)/fX (x).

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

181

1 f (x, y) dy = x 6x dy = 6xy|1x = 6x − 6x2 = 6x(1 − x), for 0 < x < 1. 6x 1 6x(1−x) = 1−x for 0 < x < 1. This means 1

fX (x) = f (y|x) =

x

E[Y |X] =

1 x

y·

1 y 2 1+x 1 x 1 1 − x2 dy = = = + . = 1−x 2(1 − x) x 2(1 − x) 2 2 2

15. A poker hand (5 cards) is dealt from a single deck of well-shuffled cards. If the random variables X and Y represent the number of aces and the number of kings in a hand, respectively, (a) Write the joint distribution pX,Y (x, y). (b) What is the marginal distribution of X, pX (x)? (c) What is the marginal distribution of Y , pY (y)?

Hint:

∞ a

y=0

x

b n−x

=

a+b n

.

Solution: (a) pX,Y (x, y) =

44 x

(b) pX (x) =

44 5−x−y 52 5

y

for x = 0, 1, 2, 3, 4; y = 0, 1, 2, 3, 4; 0 ≤ x + y ≤ 5.

f (x, y) =

y

(c) pY (y) =

y

f (x, y) =

4

4

4 48 44 = x 525−x 5−x−y y 5 48 ) (5−x

4

x 52 5 y

4 48 44 y 5−y = 52 x 5−x−y 5 48 (5−y)

4

y 52 5 x

16. If fX,Y (x, y) = 5x − y 2 in the region bounded by y = 0, x = 0, and y = 2 − 2x, find the density function for the marginal distribution of X, for 0 < x < 1. Solution:

K14521_SM-Color_Cover.indd 187

30/06/15 11:47 am

182

Probability and Statistics with R, Second Edition: Exercises and Solutions

fX (x) =

2−2x

5x − y 2 dy

0

2−2x y 3 = 5xy − 3 0

(2 − 2x)3 3 8 2 = 10x − 10x − (1 − 3x + 3x2 − x3 ) 3 8 3 8 = x + 18x − 18x2 − for 0 < x < 1 3 3 = 5x(2 − 2x) −

17. If f (x, y) = e−(x+y) , x > 0, and y > 0, find P X + 3 > Y X > Solution:

1 3

.

P X + 3 > Y, X > 13 1 = P X + 3 > Y X > 3 P X > 13

∞ x+3 1 = e−(x+y) dy dx P X + 3 > Y, X > 1 3 0 3∞ x+3 = −e−(x+y) dx 0

1

=

3∞ 1 3

−e−(2x+3) + e−x dx

∞ e−(2x+3) −x −e = 2 1 3

e−11/3 + e−1/3 =− 2

fX (x) =

∞ 0

e−(x+y) dy = e−x

1 P X> 3

=

∞

e−x dx

1 3

∞ = −e−x 1 3

=e

K14521_SM-Color_Cover.indd 188

−1/3

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

183

1 P X + 3 > Y, X > 1 3 P X + 3 > Y X > = 3 P X > 13

=

−e

−11/3

2

+ e−1/3

e−1/3

1 = 1 − e−10/3 = 0.9822 2

18. If f (x, y) = 1, 0 < x < 1, 0 < y < 1, what is P Y − X >

1 2

Solution:

X + Y >

1 2

?

P Y − X > 21 , X + Y > 12 1 1 = P Y − X > X + Y > 2 2 P X + Y > 12 P Y > X + 21 , Y − 12 > X = P X > 12 − Y

fX (x) =

1 0

f (x, y) dy =

1 0

1dy = y|10 = 1 for 0 < x < 1

1 1 P X > −Y =1−P X ≤ −Y 2 2 12 12 −y 1 dx dy =1− 0

=1−

0

1 2

0

1

x|02

−y

dy

1 − y dy =1− 2 0 1 y y 2 2 =1− − 2 2 0 1 1 − =1− 4 8 7 1 =1− = 8 8

K14521_SM-Color_Cover.indd 189

1 2

30/06/15 11:47 am

184

Probability and Statistics with R, Second Edition: Exercises and Solutions

1 1 P Y >X+ , Y − >X 2 2

=

=

1 1 2

1 1 2

y− 12

1 dx dy

0 y− 12

x|0

dy

1

1 dy 2 2 1 y y = − 2 2 1 2 1 1 1 1 1 − − − = = 2 2 8 4 8 =

1 2

y−

P Y − X > 21 , X + Y > 12 1 1 = P Y − X > X + Y > 2 2 P X + Y > 12 P Y > X + 21 , Y − 12 > X = P X > 12 − Y

=

1/ 8 7/ 8

=

1 7

19. If f (x, y) = k(y − 2x) is a joint density function over 0 < x < 1, 0 < y < 1, and y > x2 , then what is the value of the constant k? Solution: k must have a value so that

1 0

11 0

0

f (x, y) dx dy = 1

1

f (x, y) dx dy = k 0

1

0

1 x2

(y − 2x) dy dx

1 y2 − 2xy dx 2 0 x2 4 1 1 x 3 =k dx − 2x − − 2x 2 2 0 1 x5 x4 x =k − x2 − − 2 10 2 0 1 1 1 −1− + =k 2 10 2 1 =k − 10 =k

1

=⇒ k = −10

K14521_SM-Color_Cover.indd 190

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

185

20. Let X and Y have the joint density function 4 x + 32 y for 0 < x < 1 and 0 < y < 1, f (x, y) = 3 0 otherwise. Find P (2X < 1 | X + Y < 1 ). Solution:

P (2X < 1, Y < 1 − X) = =

1 2

0 1 2

0 1 2

2 4 x + y dy dx 3 3 0 1−x 1 4 xy + y 2 dx 3 3 0

1−x

1 4 (x − x2 ) + (1 − 2x + x2 ) dx 3 0 3 12 1 2 2 dx = −x + x + 3 3 0 3 1 x2 x 2 −x = + + 3 3 3 0 1 1 5 −1 + + = = 24 12 6 24 =

P (Y < 1 − X) = =

1 0 1 0

2 4 x + y dy dx 3 3 0 1−x 1 4 xy + y 2 dx 3 3

1−x

0

1

4 1 (x − x2 ) + (1 − 2x + x2 ) dx = 3 0 3 1 1 2 = dx −x2 + x + 3 3 0 3 1 x2 x −x = + + 3 3 3 0 1 −1 1 1 + + = = 3 3 3 3 So P (2X < 1 | X + Y < 1 ) =

5/ 5 P (2X < 1, Y < 1 − X) = 1 24 = P (Y < 1 − X) /3 8

21. Let X and Y have the joint density function 6(x − y)2 for 0 < x < 1 and 0 < y < 1, f (x, y) = 0 otherwise.

K14521_SM-Color_Cover.indd 191

30/06/15 11:47 am

186

Probability and Statistics with R, Second Edition: Exercises and Solutions

(a) Find P X < (b) Find P X < Solution:

Y < 1 2 Y = 1 2

1 4

.

1 4

.

(a) P X < 12 , Y < 14 1 1 = P X < Y < 2 4 P Y < 14 12 14 6(x − y)2 dy dx = 01 01 4 6(x − y)2 dy dx 0 0 21 (x−y)3 14 −3 dx 0 0 = 1 (x−y)3 14 −3 dx 0 0 12 x3 (x− 14 )3 − 3 dx = 01 33 (x− 1 )3 x − 34 dx 0 3 1 x4 − (x − 14 )4 02 = 1 x4 − (x − 14 )4 0 1 4 − ( 12 − 14 )4 − 04 − (0 − 14 )4 2 = 4 1 − (1 − 14 )4 − 04 − (0 − 14 )4 =

1 16

1 1 − 256 + 256 1 16 = = 81 1 176 11 1 − 256 + 256

(b) 2 1/2 6 x − 14 dx 1 1 0 = P X < Y = 2 dx 1 2 4 6 x − 14 0 3 1/2 2 x − 14 0 = 1 1 3 2 x− 4 0 3 1 1 3 2 2 − 4 − 0 − 14 = 3 3 2 1 − 14 − 0 − 14 1 1 + 64 2 64 = 27 1 2 64 + 64

K14521_SM-Color_Cover.indd 192

=

2 64 28 64

=

1 14

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

187

22. Let X and Y denote the weight (in kilograms) and height (in centimeters), respectively, of 20-year-old American males. Assume that X and Y have a bivariate normal distribution with parameters µX = 82, σX = 9, µY = 190, σY = 10, and ρ = 0.8. Find (a) E [ Y | X = 75 ], (b) E [ Y | X = 90 ], (c) Var [ Y | X = 75 ], (d) Var [ Y | X = 90 ], (e) P (Y ≥ 190 | X = 75 ), and (f) P (185 ≤ Y ≤ 195 | X = 90 ). Solution: Y (x − µX ) and Var[Y |X] = σY2 |x = σY2 (1 − ρ2 ). Recall that E[Y |X] = µY + ρ σσX

(a) E [ Y | X = 75 ] = 190 + 0.8 ·

10 9 (75

− 82) = 183.7778

> EYgX75 EYgX75 [1] 183.7778 (b) E [ Y | X = 90 ] = 190 + 0.8 ·

10 9 (90

− 82) = 197.1111

> EYgX90 EYgX90 [1] 197.1111 (c) Var [ Y | X = 75 ] = 102 (1 − 0.82 ) = 36 > VYgX75 VYgX75 [1] 36 (d) Var [ Y | X = 90 ] = 102 (1 − 0.82 ) = 36 > VYgX90 VYgX90 [1] 36 √ (e) P (Y ≥ 190 | X = 75 ) = 0.1499 because Y |x=75 ∼ N (183.7778, 36 ). > 1 - pnorm(190, EYgX75, sqrt(VYgX75)) [1] 0.1498593 (f) P (185 ≤ Y ≤ 195 | X = 90 ) = 0.3407

K14521_SM-Color_Cover.indd 193

30/06/15 11:47 am

188

Probability and Statistics with R, Second Edition: Exercises and Solutions

> pnorm(195, EYgX90, sqrt(VYgX90)) - pnorm(185, EYgX90, sqrt(VYgX90)) [1] 0.340706

23. Let X and Y denote the heart rate (in beats per minute) and average power output (in watts) for a 10-minute cycling time trial performed by a professional cyclist. Assume that X and Y have a bivariate normal distribution with parameters µX = 180, σX = 10, µY = 400, σY = 50, and ρ = 0.9. Find (a) E [ Y | X = 170 ], (b) E [ Y | X = 200 ], (c) Var [ Y | X = 170 ], (d) Var [ Y | X = 200 ], (e) P (Y ≤ 380 | X = 170 ), and (f) P (Y ≥ 450 | X = 200 ). Solution: Y (x − µX ) and Var[Y |X] = σY2 |x = σY2 (1 − ρ2 ). Recall that E[Y |X] = µY + ρ σσX 50 (a) E [ Y | X = 170 ] = 400 + 0.9 10 (170 − 180) = 355

> EYgX170 EYgX170 [1] 355 (b) E [ Y | X = 200 ] = 400 + 0.9 50 10 (200 − 180) = 490 > EYgX200 EYgX200 [1] 490 (c) Var [ Y | X = 170 ] = 502 (1 − 0.92 ) = 475 > VYgX170 VYgX170

50^2*(1 - 0.9^2)

[1] 475 (d) Var [ Y | X = 200 ] = 502 (1 − 0.92 ) = 475 > VYgX200 VYgX200

50^2*(1 - 0.9^2)

[1] 475 (e) P (Y ≤ 380 | X = 170 ) = 0.8743

K14521_SM-Color_Cover.indd 194

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

189

> pnorm(380, EYgX170, sqrt(VYgX170)) [1] 0.8743254 (f)P (Y ≥ 450 | X = 200 ) = 0.9668 > 1 - pnorm(450, EYgX200, sqrt(VYgX200)) [1] 0.9667713

24. A certain group of college students takes both the Scholastic Aptitude Test (SAT) and an intelligence quotient (IQ) test. Let X and Y denote the students’ scores on the SAT and IQ tests, respectively. Assume that X and Y have a bivariate normal distribution with parameters µX = 980, σX = 126, µY = 117, σY = 7.2, and ρ = 0.58. Find (a) E [ Y | X = 1350 ], (b) E [ Y | X = 700 ], (c) Var [ Y | X = 700 ], (d) P (Y ≤ 120 | X = 1350 ), and (e) P (Y ≥ 100 | X = 700 ). Solution: Y (x − µX ) and Var[Y |X] = σY2 |x = σY2 (1 − ρ2 ). Recall that E[Y |X] = µY + ρ σσX

7.2 (a) E [ Y | X = 1350 ] = 117 + (0.58) 126 (1350 − 980) = 129.2629

> EYgX1350 EYgX1350 [1] 129.2629 7.2 (b) E [ Y | X = 700 ] = 117 + (0.58) 126 (700 − 980) = 107.72

> EYgX700 EYgX700 [1] 107.72 (c) Var [ Y | X = 700 ] = 7.22 (1 − 0.582 ) = 34.401 > VYgX700 VYgX700 [1] 34.40102 (d) P (Y ≤ 120 | X = 1350 ) = 0.0571 > pnorm(120, EYgX1350, sqrt(VYgX700)) [1] 0.05713586 (e) P (Y ≥ 100 | X = 700 ) = 0.906

K14521_SM-Color_Cover.indd 195

30/06/15 11:47 am

190

Probability and Statistics with R, Second Edition: Exercises and Solutions

> 1 - pnorm(100, EYgX700, sqrt(VYgX700)) [1] 0.9059515

25. A canning industry uses tins with weight equal to 20 grams. The tin is placed on a scale and filled with red peppers until the scale shows the weight µ. Then, the tin contains Y grams of peppers. If the scale is subject to a random error X ∼ N (0, σ = 10), (a) How is Y related to X and µ? (b) What is the probability distribution of the random variable Y ? (c) Calculate the value µ such that 98% of the tins contain at least 400 grams of peppers. (d) Repeat the exercise assuming the weight of the tins to be a normal random variable W ∼ N (20, σ = 5) if X and W are independent. Solution: (a) Y = µ + X − 20 (b) E[Y ] = E[µ + X − 20] = E[µ] + E[X] − E[20] = µ + 0 − 20 = µ − 20 and Var[Y ] = Var[µ + X − 20] = Var[X] = 100, since X ∼ N (0, 10) it follows that Y ∼ N (µ − 20, 10). (c) P (Y ≥ 400) = 0.98 1 − P (Y ≤ 400) = 0.98 P (Y ≤ 400) = 0.02 P (Z ≤ 400 − (µ − 20) /10) = 0.02

=⇒ (420 − µ)/10 = Z0.02 µ = 420 − Z0.02 × 10 = 440.5375

> mu mu [1] 440.5375 µ must be at least 440.5375 grams to be 98% confident that the tins contain at least 400 grams of peppers. (d) Y = µ + X − W , where X ∼ N (0, 10) and W ∼ N (20, 5). E[Y ] = µ + 0 − 20 = µ − 20 Var[Y ] = Var[X] + Var[W ] = 100 + 25 = 125 √ Y ∼ N (µ − 20, 125 ).

K14521_SM-Color_Cover.indd 196

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

P(Y ≥ 400) = 0.98 1 − P(Y ≤ 400) = 0.98 P(Y ≤ 400) = 0.02 √ P Z ≤ 400 − (µ − 20) / 125 = 0.02 √ =⇒ (420 − µ)/ 125 = Z0.02

µ = 420 − Z0.02 ×

√

191

125 = 442.9616

> mu mu [1] 442.9616 µ must be at least 442.9616 grams to be 98% confident that the tins contain at least 400 grams of peppers. 26. Given the joint density function fX,Y (x, y) = x + y, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, ∞ ∞ (a) Show that fX,Y (x, y) ≥ 0 for all x and y and that −∞ −∞ fX,Y (x, y) dx dy = 1.

(b) Find the cumulative distribution function. (c) Find the marginal means of X and Y . (d) Find the marginal variances of X and Y . Solution:

(a) By inspection x + y is greater than or equal to zero in the ranges for x and y given. For property 2,

1 0

1 0

1 x2 + xy dy 2 0 0 1 1 + y dy = 2 0 1 1 y 2 = y+ 2 2 0 1 1 = + =1 2 2

(x + y) dx dy =

1

(b) The cumulative distribution function must be defined in the five regions: x, y < 0; 0 ≤ x < 1 and 0 ≤ y < 1; x ≥ 1 and 0 ≤ y < 1; y ≥ 1 and 0 ≤ x < 1; and x, y ≥ 1. For the region 0 ≤ x < 1 and 0 ≤ y < 1, x y x2 y xy 2 + . (x + y) dy dx = FX,Y (x, y) = 2 2 0 0

K14521_SM-Color_Cover.indd 197

30/06/15 11:47 am

192

Probability and Statistics with R, Second Edition: Exercises and Solutions

For the region x ≥ 1 and 0 ≤ y < 1, FX,Y (x, y) = For the region y ≥ 1 and 0 ≤ x < 1, FX,Y (x, y) =

y 0

x 0

1

(x + y) dx dy =

y y2 + . 2 2

(x + y) dy dx =

x x2 + . 2 2

0

1 0

  0    x2 y xy 2    2 +2 2 FX,Y (x, y) = y2 + y2   x x2   2 + 2   1

x, y < 0 0 ≤ x < 1, 0 ≤ y < 1 x ≥ 1, 0 ≤ y < 1 y ≥ 1, 0 ≤ x < 1 x, y ≥ 1

(c) To find the marginal means and variances, the marginal densities of x and y must be calculated. fX (x) =

1 2

1 0

1 y 2 1 (x + y) dy = xy + = x + for 0 ≤ x ≤ 1 2 0 2

Similarly, fY (y) = y + for 0 ≤ y ≤ 1. 1 3 2 1 7 E[X] = 0 x x + 12 dx = x3 + x4 = 12 0 1 3 2 1 7 E[Y ] = 0 y y + 12 dy = y3 + y4 = 12 0

2

(d) To calculate the variance, E[X ] and E[Y 2 ] must be found: 1 4 3 1 5 E[X 2 ] = 0 x2 x + 12 dx = x4 + x6 = 12 0 1 4 3 1 5 E[Y 2 ] = 0 y 2 y + 12 dy = y4 + y6 = 12 0 5 7 2 11 Var[X] = E[X 2 ] − (E[X 2 ]) = 12 − 12 = 60−49 144 = 144 2 5 7 11 Var[Y ] = E[Y 2 ] − (E[Y 2 ]) = 12 − 12 = 60−49 144 = 144 27. The lifetime of two electronic components are two random variables, X and Y . Their joint density function is given by fX,Y (x, y) = (a) Verify that

∞ ∞

−∞ −∞

1 + x + y + cxy exp(−(x + y)) (c + 3)

x ≥ 0 and y ≥ 0.

fX,Y (x, y) dx dy = 1.

(b) Find fX (x).

(c) What value of c makes X and Y independent? Solution:Note that

K14521_SM-Color_Cover.indd 198

∞ 0

∞ e−x dx = −e−x 0 = 1

30/06/15 11:47 am

Chapter 5: and

(a) ?

1=

∞

∞

∞

Multivariate Probability Distributions

xe

−x

dx =

0

−xe−x |∞ 0

+

∞

193

e−x dx = 1.

0

fX,Y (x, y) dx dy

−∞ −∞ ∞ ∞

1 + x + y + cxy exp(−(x + y)) dx dy (c + 3) 0 0 ∞ ∞ ∞ ∞ 1 exp(−(x + y)) dx dy + x exp(−(x + y)) dx dy = c+3 0 0 0 0 ∞ ∞ ∞ ∞ + y exp(−(x + y)) dx dy + c xy exp(−(x + y)) dx dy 0 0 ∞ 0 0∞ ∞ ∞ 1 −y −x −y −x e e dx dy + e xe dx dy = c+3 0 0 0 0 ∞ ∞ ∞ ∞ + e−x ye−y dy dx + c ye−y xe−x dx dy

=

0

0

0

0

1 [1 + 1 + 1 + c] = c+3

1=1 (b) fX (x) =

∞

fX,Y (x, y) dy

0 ∞

1 + x + y + cxy exp(−(x + y)) dy (c + 3) 0 ∞ ∞ 1 exp(−(x + y)) dy + x exp(−(x + y)) dy = c+3 0 0 ∞ ∞ + y exp(−(x + y)) dy + cxy exp(−(x + y)) dy

=

0

0

1 −x e + xe−x + e−x + cxe−x = c+3 e−x [ 2 + x + cx] for x ≥ 0 fX (x) = c+3

(c) If X and Y are to be independent, fX (x) · fY (y) = fX,Y (x, y). set

fX (x) · fY (y) = fX,Y (x, y)

1 + x + y + cxy −x −y e−x [ 2 + x + cx] e−y [ 2 + y + cy] · = e e c+3 c+3 (c + 3) [ 2 + x + cx] · [ 2 + y + cy] = (1 + x + y + cxy)(c + 3)

4 + 2y + 2cy + 2x + xy + cxy + 2cx + cxy + c2 xy =

c + cx+cy + c2 xy + 3 + 3x + 3y + 3cxy 1 − y − x + xy = c(1 − x − y + xy) ∴ c = 1 ∀(x, y).

K14521_SM-Color_Cover.indd 199

30/06/15 11:47 am

194

Probability and Statistics with R, Second Edition: Exercises and Solutions

If c = 1, X and Y are independent. 28. Given the joint continuous pdf fX,Y (x, y) =

1 if 0 ≤ x ≤ 1, 0 otherwise

0 ≤ y ≤ 1 and

and using the function adaptIntegrate() from the package cubature, (a) find FX,Y (x = 0.6, y = 0.8). (b) find P(0.25 ≤ X ≤ 0.75, 0.1 ≤ Y ≤ 0.9). (c) find fX (x). Solution: (a) FX,Y (x = 0.6, y = 0.8) =

0.6 0.8 0

0

fX,Y (r, s) ds dr = 0.48

> library(cubature) # load > f adaptIntegrate(f, lowerLimit = c(0, 0), upperLimit = c(0.8, 0.6)) $integral [1] 0.48 $error [1] 0 $functionEvaluations [1] 17 $returnCode [1] 0 (b) P(0.25 ≤ x ≤ 0.75, 0.1 ≤ y ≤ 0.9) =

0.75 0.9 0.25

0.1

fX,Y (r, s) ds dr = 0.4

> adaptIntegrate(f, lowerLimit = c(0.10, 0.25), upperLimit = c(0.9, 0.75)) $integral [1] 0.4 $error [1] 0 $functionEvaluations [1] 17 $returnCode [1] 0 1 (c) fX (x) = 0 fX,Y (x, y) dy = 1

K14521_SM-Color_Cover.indd 200

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

195

> adaptIntegrate(f, lowerLimit = 0, upperLimit = 1) $integral [1] 1 $error [1] 1.110223e-14 $functionEvaluations [1] 15 $returnCode [1] 0

29. Let X and Y have the joint density function Kxy 2 ≤ x ≤ 4 and 4 ≤ y ≤ 6 and fX,Y (x, y) = 0 otherwise. (a) Find K so that the given function is a valid pdf. (b) Find the marginal densities of X and Y . (c) Are X and Y independent? Justify. Solution: (a) set

1= =

6

4 6 4

= =

6

4

4

Kxy dx dy 2

4 Kx2 y dy 2 2

16Ky 4Ky − dy 2 2

6

6Ky dy 4

= 3Ky 2 |64

= 3K(36 − 16) 1 = 60K 1 =⇒ K = 60 > f K MASS::fractions(K) [1] 1/60

K14521_SM-Color_Cover.indd 201

30/06/15 11:47 am

196

Probability and Statistics with R, Second Edition: Exercises and Solutions

(b)

fX (x) =

fY (y) =

6 4

4 2

6 xy 2 x x 1 xy dy = [36 − 16] = for 2 ≤ x ≤ 4 = 60 120 4 120 6 4 x2 y y y 1 xy dx = [16 − 4] = for 4 ≤ y ≤ 6 = 60 120 2 120 10

(c) X and Y are independent since fX (x) · fY (y) =

x 6

·

y 10

=

xy 60

= fX,Y (x, y).

30. Given the joint density function of X and Y

fX,Y (x, y) =

1/2 0

x + y ≤ 2, otherwise.

x ≥ 0,

y ≥ 0 and

(a) Find the marginal densities of X and Y .

(b) Find E[X], E[ Y ], Cov [X, Y ], and ρX,Y . (c) Find P X + Y < 1 X >

1 2

.

Solution: (a) The marginal of X is

fX (x) =

2−x

y 2−x x 2−x 1 dy = = 1 − for 0 ≤ x ≤ 2, = 2 2 0 2 2

2−y

x 2−y y 2−y 1 dx = = 1 − for 0 ≤ y ≤ 2. = 2 2 0 2 2

0

and the marginal of Y is

fY (y) =

K14521_SM-Color_Cover.indd 202

0

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

197

(b) 2 2 x x3 2 4 x2 dx = − xfX (x) dx = =2− = x− E[X] = 2 2 6 0 3 3 0 0 2 2 2 2 3 2 y y 2 4 y dy = − yfY (y) dy = y− E[Y ] = =2− 3 = 3 2 2 6 0 0 0 2−y 2 2 2−y 2 2−y xy y x2 dx dy = · xyfX,Y (x, y) dx dy = dy E[XY ] = 2 2 0 0 0 0 0 0 2 2 2 y3 y 4 1 y2 8 y 2 = (4 − 4y + y ) dy = − + =2− +1= 2 3 16 0 3 3 0 4 −1 1 2 2 Cov [X, Y ] = E[XY ] − E[X] · E[Y ] = − · = 3 3 3 9 2 2 2 3 3 x x4 2 8 x 2 2 2 E[X ] = dx = − = −2= x fX (x) dx = x − 2 3 8 3 3 0 0 0 2 2 2 y3 y4 2 8 y3 E[Y 2 ] = dy = − = − 2 = y 2 fY (y) dy = y2 − 2 3 8 3 3 0 0 0 2 2 2 2 = Var[X] = E[X 2 ] − (E[X])2 = − 3 3 9 2 2 2 2 = Var[Y ] = E[Y 2 ] − (E[Y ])2 = − 3 3 9 −1 Cov [X, Y ] −1/9 = = ρX,Y = σX · σY 2 2/9 · 2/9

(c)

2

2

1 P X + Y < 1, X > 1 2 P X + Y < 1 X > = 2 P X > 12 1 1−x 1 1 2 dy dx 0 = 2 2 x 1 1 − 2 dx 2 1 1−x 1 2 dx = 2 2 2 x − x4 1 2 1 x x2 2 − 4 1 2 = 2 2 x − x4 1 1 12 1 1 2 − 4 − 4 − 16 = 1 (2 − 1) − 12 − 16

=

1 4

3 − 16 1 7 = 9 1 − 16

31. Let X and Y have the joint density function

K14521_SM-Color_Cover.indd 203

30/06/15 11:47 am

198

Probability and Statistics with R, Second Edition: Exercises and Solutions

fX,Y (x, y) =

Ky 0

−2 ≤ x ≤ 2, 1 ≤ y ≤ x2 and otherwise.

(a) Find K so that fX,Y (x, y) is a valid pdf. (b) Find the marginal densities of X and Y . (c) Find P Y > 32 X < 12 . Solution:

(a) Note that −2 ≤ x ≤ 2 and 1 ≤ y ≤ x2 implies that 1 ≤ x2 , which means that (x ≤ −1) ∪ (x ≥ 1) is implied in the ranges of the variables. set

1= =

∞ −∞ −1

−2

∞

fX,Y (x, y) dx dy

−∞ x2

Ky dy dx +

1 −1

2

1 2

x2

Ky dy dx 1 4

x4 − 1 x −1 dx + dx 2 2 −2 1 2 −1 x x x5 x5 − − + =K 10 2 −2 10 2 1 −1 1 −32 32 1 1 =K + − +1 + −1 − − 10 2 10 10 10 2 22 22 4 4 =K + + + 10 10 10 10 26 1=K 5 5 =⇒ K = 26 =K

(b) fX (x) =

fX,Y (x, y) dy y x2

5 y dy 26 1 x 2 5 2 y = 52

=

1

5 4 (x − 1) = 52 5 (x4 − 1), =⇒ fX (x) = 52 0,

K14521_SM-Color_Cover.indd 204

(−2 ≤ x ≤ −1) ∪ (1 ≤ x ≤ 2) otherwise

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

fY (y) = =

fX,Y x − √y

(x, y) dx

5 y dx + 26

−2

199

2 √

y

5 y dx 26

− √ y 2 5 5 xy xy + = 26 −2 26 √y 5 3/2 −y = + 2y + 2y − y 3/2 26 5 4y − 2y 3/2 = 26 5 (2y − y 3/2 ), 1 ≤ y ≤ 4 =⇒ fY (y) = 13 0, otherwise (c) P Y > 32 , X < 12 3 1 P Y > X < = 2 2 P X < 12 −√3/2 x2 5 y dy dx 3/2 26 −2 = −1 5 (x4 − 1) dx −2 52 √ − 3/2 y2 x2 dx 2 −2 3/2 = 1 x5 −1 2 ( 5 − x) −2 −√3/2 4 9 x − 4 dx = −1−2 ( 5 + 1) − ( −32 + 2) √ 5 − 3/2 x5 9 5 − 4 x −2 = 4 22 + 5√ 5 3/2 9 9 9 − 4 5 + 4 32 − −32 5 + 2 = 26

=

=

36 20

9

3 2

5

3 19 2 + 10 26 5

+

19 2

√ 26 9 6 + 19 = 0.7893 = 52

32. An engineer has designed a new diesel motor that is used in a prototype earth mover.

K14521_SM-Color_Cover.indd 205

30/06/15 11:47 am

200

Probability and Statistics with R, Second Edition: Exercises and Solutions

The prototype’s diesel consumption in gallons per mile C follows the equation C = 3 + 2X + 32 Y , where X is a speed coefficient and Y is the quality diesel coefficient. Suppose the joint density for X and Y is fX,Y (x, y) = ky, 0 ≤ x ≤ 2, 0 ≤ y ≤ x. (a) Find k so that fX,Y (x, y) is a valid density function.

(b) Are X and Y independent?

(c) Find the mean diesel consumption for the prototype.

Solution: (a) set

1= = =

∞

∞

fX,Y (x, y) dx dy

−∞ −∞ 2 x

0

ky dy dx x ky 2 dx 2 0 0

2

0 2

kx2 dx 2 0 2 kx3 = 6

=

0

8k 1= 6 3 =⇒ k = 4

(b) For X and Y to be independent, fX (x) · fY (y) = fX,Y (x, y).

fX (x) = fY (y) =

fX (x) · fY (y) =

K14521_SM-Color_Cover.indd 206

9 2 32 x (2y

x 0 2 y

x 3 2 3 y dy = y = 4 8 0 2 3 3 y dx = xy = 4 4 y

3 2 x for 0 ≤ x ≤ 2 8 3 (2y − y 2 ) for 0 ≤ y ≤ 2 4

− y 2 ) = 34 y = fX,Y (x, y), so X and Y are not independent.

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

201

(c)

3 3 E[C] = E 3 + 2X + Y = 3 + 2E[X] + E[Y ] 2 2 2 2 2 3x4 3 3 E[X] = x · fX (x) dx = x · x2 dx = = 8 32 0 2 0 0 3 2 4 2 y 3 3 2y 3 16 2 − −4 =1 y · (2y − y ) dy = = E[Y ] = 4 4 3 4 0 4 3 0 3 E[C] = 3 + 2E[X] + E[Y ] 2 3 3 = 3 + 2 · + · 1 = 7.5 2 2 The average gas consumption is 7.5 gallons/mile. 33. To make porcelain, kaolin X and feldspar Y are needed to create a soft mixture that later becomes hard. The proportion of these components for every tone of porcelain has the density function fX,Y (x, y) = Kx2 y, 0 ≤ x ≤ y ≤ 1, x + y ≤ 1. (a) Find the value of K so that fX,Y (x, y) is a valid pdf. (b) Find the marginal densities of X and Y . (c) Find the kaolin mean and the feldspar mean by tone. (d) Find the probability that the proportion of feldspar will be higher than 13 , if the kaolin is more than half of the porcelain. Solution: (a) set

1= = = =

∞

∞

fX,Y (x, y) dx dy

−∞ −∞ 1 1−y

kx2 y dx dy

0

0

1

0

k 3

1−y x3 dy ky · 3 0 1

0

y − 3y 2 + 3y 3 − y 4 dy

1 3 4 y 5 k y2 3 −y + y − = 3 2 4 5 0 3 1 k 1 −1+ − = 3 2 4 5 k 1= 60 =⇒ k = 60

K14521_SM-Color_Cover.indd 207

30/06/15 11:47 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

202 (b)

fX (x) = fY (y) =

1−x 0 1−y 0

1−x 60x2 y dy = 30x2 y 2 0 = 30x2 (1 − x)2 for 0 ≤ x ≤ 1

1−y 60x2 y dx = 20x3 y 0 = 20y(1 − y)3 for 0 ≤ y ≤ 1

(c) E[X] =

0

1

x · 30x2 (1 − x)2 dx =

0

1

30x3 − 60x4 + 30x5 dx

1 1 15 15x4 5 6 − 12x + 5x = − 12 + 5 = = 2 2 2 0 1 1 E[Y ] = y · 20y(1 − y)3 dy = 20y 2 (1 − 3y + 3y 2 − y 3 ) dy 0

0

1 10 1 10y 6 20 20y − 15y 4 + 12y 5 − − 15 + 12 − = = 3 3 0 3 3 3 3

=

(d) P Y > 13 , X > 12 1 1 = P Y > X > 3 2 P X > 12 12 1−y 60x2 y dx dy 1 1 = 1 3 2 2 3 4 1 30x − 60x + 30x dx 2 12 1−y 2 2x y dx dy 1 1 = 13 2 2 3 4 1 x − 2x + x dx 2 1−y 12 x3 dy 1 2y 3 1 3 2 = 3 x x4 x 5 1 3 − 2 + 5 12 12 2y 1 − 3y + 3y 2 − y 3 − 18 dy 1 3 = 3 1 1 1 1 1 1 3 − 2 + 5 − 24 − 32 + 160 12 7 2 3 4 1 4 y − 6y + 6y − 2y dy 3 1 = 1 3 30 − 60 2 1 7y 2y 5 2 3y 4 3 = 20 − 2y + − 8 2 5 1 3 1 3 1 2 1 2 7 7 − + − − − + − = 20 32 4 32 80 72 27 54 1215 389 97 1 = 20 − = = 0.1996 20 9720 486

K14521_SM-Color_Cover.indd 208

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

203

34. A device can fail in four different ways with probabilities π1 = 0.2, π2 = 0.1, π3 = 0.4, and π4 = 0.3. Suppose there are 12 devices that fail independently of one another. What is the probability of 3 failures of the first kind, 4 of the second, 3 of the third, and 2 of the fourth? Solution: P(x1 = 3, x2 = 4, x3 = 3, x4 = 2) =

12! (0.2)3 (0.1)4 (0.4)3 (0.3)2 = 0.0013 3! · 4! · 3! · 2!

> dmultinom(x=c(3, 4, 3, 2), size=12, prob=c(0.2, 0.1, 0.4, 0.3)) [1] 0.001277338

35. The wait time in minutes a shopper spends in a local supermarket’s checkout line has distribution exp(−x/2) , x > 0. f (x) = 2 On weekends, however, the wait is longer, and the distribution then is given by g(x) =

exp(−x/3) , x > 0. 3

Find (a) The probability that the waiting time for a customer will be less than 1 minute. (b) The probability that, given a waiting time of less than 2 minutes, it will be a weekend. (c) The probability that the customer waits less than 2 minutes. Solution: (a) Assuming that the day a person shops is uniformly distributed across the week and letting X = wait time in minutes that a shopper spend in a local supermarket’s check out line and Y = weekend indicator where 0 is a weekday and 1 is a weekend, P(X < 1) = P(X < 1 ∩ Y = 0) + P(X < 1 ∩ Y = 1) = P(X < 1 | Y = 0 )P(Y = 0) + P(X < 1 | Y = 1 )P(Y = 1) 5 1 e−x/2 2 1 e−x/3 = dx + dx 7 0 2 7 0 3 2 5 = 1 − e−1/2 + 1 − e−1/3 = 0.362 7 7 > (1 - exp(-1/2))*5/7 + (1 - exp(-1/3))*2/7 [1] 0.3620406 > # or > pexp(1, 1/2)*5/7 + pexp(1, 1/3)*2/7 [1] 0.3620406

K14521_SM-Color_Cover.indd 209

30/06/15 11:47 am

204

Probability and Statistics with R, Second Edition: Exercises and Solutions

(b) P(Y = 1|X < 2) = 0.2354 > ((1 - exp(-2/3))*2/7) / ((1 - exp(-1))*5/7 + (1 - exp(-2/3))*2/7) [1] 0.2354185 (c) P(X < 2) = 0.5905 Same as (a) with integrals from 0 to 2 rather than 1. > (1 - exp(-1))*5/7 + (1 - exp(-2/3))*2/7 [1] 0.5905384 > # or > pexp(2, 1/2)*5/7 + pexp(2, 1/3)*2/7 [1] 0.5905384

36. An engineering team has designed a lamp with two light bulbs. Let X be the lifetime for bulb 1 and Y the lifetime for bulb 2, both in thousands of hours. Suppose that X and Y are independent and they follow an Exp(λ = 1) distribution. (a) Find the joint density function of X and Y . What is the probability neither bulb lasts longer than 1000 hours? (b) If the lamp works when at least one bulb is lit, what is the probability that the lamp works no more than 2000 hours? (c) What is the probability that the lamp works between 1000 and 2000 hours? Solution: Note that the distribution function of an exponential with λ = 1 is FX (x) = 1 − e−x . (a) fX,Y (x, y) = fX (x) · fY (y) = e−x · e−y = e−(x+y) for x ≥ 0, y ≥ 0 because X and Y are both distributed as Exp(λ = 1) and are independent. 2 P(X < 1, Y < 1) = FX (1) · FY (1) = 1 − e−1 = 0.3996

> (1 - exp(-1))^2 [1] 0.3995764

> # or > pexp(1, 1)*pexp(1, 1) [1] 0.3995764 (b) For the lamp to stop working within 2000 hours, both bulbs must die within 2000 hours. 2 P (X < 2) ∩ (Y < 2) = P(X < 2) · P(Y < 2) = 1 − e−2 = 0.7476

K14521_SM-Color_Cover.indd 210

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

205

> (1 - exp(-2)) * (1 - exp(-2)) [1] 0.7476451 > # or > pexp(2, 1) * pexp(2, 1) [1] 0.7476451 (c) The probability that the lamp works between 1000 and 2000 hours is

1 − e−2

> (1 - exp(-2))^2 -

2

2 − 1 − e−1 = 0.3481.

(1 - exp(-1))^2

[1] 0.3480687 > # or > pexp(2, 1)^2 - pexp(1, 1)^2 [1] 0.3480687

37. The national weather service has issued a severe weather advisory for a particular county that indicates that severe thunderstorms will occur between 9 p.m. and 10 p.m. When the rain starts, the county places a call to the maintenance supervisor who opens the sluice gate to avoid flooding. Assuming the rain’s start time is uniformly distributed between 9 p.m. and 10 p.m., (a) at what time, on the average, will the county maintenance supervisor open the sluice gate? (b) What is the probability that the sluice gate will be opened before 9:30 p.m.? Note: Solve this problem both by hand and using R. Solution: (a) Let X = time maintenance supervisor opens the sluice gate. Then X ∼ Unif (9, 10) and 1 for 9 ≤ x ≤ 10. fX (x) = 10−9 E[X] =

9

10

x · fX (x) dx =

9

10

10 x2 100 − 81 x dx = = 9.5 = 10 − 9 2 9 2

> fx integrate(fx, lower = 9, upper = 10)$value [1] 9.5

On the average, the sluice gate opens at 9:30 p.m. 9.5 (b) P(X < 9 : 30) = P(X < 9.5) = 9 1 dx = x|9.5 9 = 0.5

K14521_SM-Color_Cover.indd 211

30/06/15 11:47 am

206

Probability and Statistics with R, Second Edition: Exercises and Solutions

> fx integrate(Vectorize(fx), lower=9, upper=9.5)$value [1] 0.5 The probability that the sluice gate opens before 9:30 p.m. is 0.5. 38. Assume the distribution of grades for a particular group of students has a bivariate normal distribution with parameters µX = 3.2, µY = 2.4, σX = 0.4, σY = 0.6, and ρ = 0.6, where X and Y represent the grade point averages in high school and the first year of college, respectively. (a) Set the seed equal to 194 (set.seed(194)), and use the function mvrnorm() from the MASS package to simulate the population, assuming the population of interest consists of 200 students. (Hint: Use empirical = TRUE.) (b) Compute the means of X and Y . Are they equal to 3.2 and 2.4, respectively? (c) Compute the variance of X and Y as well as the covariance between X and Y . Are the values 0.16, 0.36, and 0.144, respectively? (d) Create a scatterplot of Y versus X. If a different seed value is used, how do the simulated numbers differ? Solution: (a) > > > > > +

CovXY cov(values) [,1] [,2] [1,] 0.160 0.144 [2,] 0.144 0.360

K14521_SM-Color_Cover.indd 212

30/06/15 11:47 am

Chapter 5:

Multivariate Probability Distributions

207

(d) When a different seed value is used, different values are created. However, the means of the new values will still have means of 3.2, and 2.4, standard deviations of 0.4, and 0.6, for X and Y , respectively with the argument empirical = TRUE. Further, the covariance between X and Y will remain 0.144.

2.5 1.0

1.5

2.0

Y

3.0

3.5

4.0

> plot(values[,1], values[,2], xlab = "X", ylab = "Y", col = "blue", + pch = 19, cex = 0.5)

2.0

2.5

3.0

3.5

4.0

X

39. Show that if X1 , X2 , . . . , Xn are independent random variables with means µ1 , µ2 , . . . , µn n mean and variance of Y = and variances σ12 , σ22 , . . . , σn2 , respectively, then the n 2 2i=1 ci Xi , n 2 where the ci s are real-valued constants, are µY = i=1 ci µi and σY = i=1 ci σi . (Hint: Use moment generating functions.) Solution: Solving directly:

E[Y ] = E

n i=1

K14521_SM-Color_Cover.indd 213

c i Xi =

n i=1

ci E [Xi ] =

n

ci µi

i=1

30/06/15 11:47 am

208

Probability and Statistics with R, Second Edition: Exercises and Solutions

Var[Y ] = E (Y − E[Y ])2  n 2  n  =E c i Xi − E c i Xi 

=E = =

n

i=1 n

i=1

n i=1

i=1

2  ci (Xi − µi ) 

c2i E (Xi − µi )2 + 2 ci cj E (Xi − µi )(Xj − µj ) i pt(3, 5) [1] 0.9849504 (b) P(2 < X < 3) = 0.0359 > pt(3, 5) - pt(2, 5) [1] 0.03592012 (c) If P(X < a) = 0.05, a = −2.015. > qt(0.05, 5) [1] -2.015048 3. If (1 − 2t)−5 , t < 12 , is the mgf of a random variable X, find P(X < 15.99). Solution: MX (t) = (1 − 2t)−5 =⇒ X ∼ χ210 , so P(X < 15.99) = 0.9001

209

K14521_SM-Color_Cover.indd 215

30/06/15 11:47 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

210

> pchisq(15.99, 10) [1] 0.900081

4. If X ∼ χ210 , find the constants a and b so that P(a < X < b) = 0.90 and P(X < a) = 0.05. Solution: X ∼ χ210 , P(a < X < b) = 0.90, and P(X < a) = 0.05 =⇒ a = 3.9403, and b = 18.307. > a b c(a, b) [1]

3.940299 18.307038

5. Let X be a χ210 . Calculate P(X < 8) and P(X > 6). Calculate a so that P(X < a) = 0.15. What are the population mean and population variance of X? Solution: P(X < 8) = 0.3712 and P(X > 6) = 0.8153. If P(X < a) = 0.15, then a = 5.5701. The population mean and population variance of a χ210 random variable are 10 and 20, respectively. > pchisq(8, 10) [1] 0.3711631 > pchisq(6, 10, lower = FALSE) [1] 0.8152632 > qchisq(0.15, 10) [1] 5.570059

6. Let X be distributed as an F2,5 . Calculate P(X < 1) and the median of X. Calculate a so that P(X < a) = 0.10. What are the population mean and population variance of X? Solution: P(X < 1) = 0.5688; the median of X = 0.7988; and a = 0.1076 if P(X < a) = 0.10. The population mean and population variance of an F2,5 random variable are 1.6667 and 13.8889, respectively. > pf(1, 2, 5) [1] 0.5687988 > qf(0.50, 2, 5)

K14521_SM-Color_Cover.indd 216

30/06/15 11:47 am

Chapter 6:

Sampling and Sampling Distributions

211

[1] 0.7987698 > qf(0.10, 2, 5) [1] 0.1076122 > EX VX c(EX, VX) [1]

1.666667 13.888889

7. Assume a population with 5 elements: X1 = 0,

X2 = 1,

X3 = 2,

X4 = 3,

X5 = 4.

(a) Calculate µ and σ 2 . (b) Calculate the sampling distribution of the mean for random samples of size 3 taken 2 without replacement. Verify that the mean of X is 2 and that the variance of X is σ6 . (c) Calculate the sampling distribution of X for random samples of size 3 taken with 2 replacement. Verify that the mean of X is 2 and that the variance of X is σn . Solution: (a) The population mean and population variance are 2 and 2, respectively. > > > > >

pop > >

SRS >

RS Sd Sd [1] 4.147288 (e) The mean of the sample mean when sampling without replacement and with replacement are both 8.

K14521_SM-Color_Cover.indd 219

30/06/15 11:47 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

214 > > > > >

xbarSRS > > >

S2srs 2 and

(b) P SXu ≤ 4 . Solution:

Note that (a) P

X S

X−µ √ ∼ tn−1 . s/ n √ > 2 = P(t5 > 2 6 ) = 0.0022

> 1 - pt(sqrt(6)*2, 5) [1] 0.002239216

K14521_SM-Color_Cover.indd 221

30/06/15 11:47 am

Probability and Statistics with R, Second Edition: Exercises and Solutions √ X−0 X X n−1 √ n = Su = (b) ∼ tn−1 . Su n−1 √ Su √ n−1 √ n √ √ √ √ P SXu ≤ 4 = P −4 5 ≤ SXu 5 ≤ 4 5 = P −4 5 ≤ t5 ≤ 4 5 = 0.9997.

216

> pt(4*sqrt(5), 5) - pt(-4*sqrt(5), 5) [1] 0.9997089

12. Constant velocity joints (CV joints) allow a rotating shaft to transmit power through a variable angle, at constant rotational speed, without an appreciable increase in friction or play. An after-market company produces CV joints. To optimize energy transfer, the drive shaft must be very precise. The company has two different branches that produce CV joints where the variability of the drive shaft is known to be 2 mm. A sample of n1 = 10 is drawn from the first branch, and a sample of n2 = 15 is drawn from the second branch. Suppose that the diameter follows a normal distribution. What is the probability that the drive shafts coming from the first branch will have greater variability than those of the second branch? Solution: 2 s P 12 > 1 = P (F9,14 > 1) = 0.4823 s2 > 1 - pf(1, 9, 14) [1] 0.4823316

13. Given a population N (µ, σ) with unknown mean and variance, a sample of size 11 is 2 drawn and the sample variance S 2 is calculated. Calculate the probability P(0.5 < Sσ2 < 1.2). Solution: S2 S2 P 0.5 < 2 < 1.2 = P 10(0.5) ≤ 10 2 ≤ 10(1.2) σ σ = P(5 ≤ χ210 ≤ 12)

= P(χ210 ≤ 12) − P(χ210 ≤ 5) = 0.6061 > pchisq(12, 10) - pchisq(5, 10) [1] 0.6061215

14. The vendor in charge of servicing coffee dispensers is adjusting the one located in the department of statistics. To maximize profit, adjustments are made so that the average quantity of liquid dispensed per serving is 200 milliliters per cup. Suppose the amount of liquid per cup follows a normal distribution and 5.5% of the cups contain more than 224 milliliters.

K14521_SM-Color_Cover.indd 222

30/06/15 11:47 am

Chapter 6:

Sampling and Sampling Distributions

217

(a) Find the probability that a given cup contains between 176 and 224 milliliters. (b) If the machine can hold 20 liters of liquid, find the probability that the machine must be replenished before dispensing 99 cups. (c) If 6 random samples of 5 cups are drawn, what is the probability that the sample mean is greater than 210 milliliters in at least 2 of them? Solution: Let X represent the amount of coffee dispensed in a cup. Then, X ∼ N (200, σ). Given that P(X > 224) = 0.055, σ = 15.017.

P

P(X > 224) = 0.055 X − 200 224 − 200 > = 0.055 σ σ 24 P Z> = 0.055 σ 24 = 0.055 1−P Z ≤ σ 24 P Z≤ = 0.945 σ 24 =⇒ Z0.945 = σ 24 24 = 15.017 = σ= Z0.945 1.598

> sigma sigma [1] 15.01696 (a) So, X ∼ N (200, 15.017). P(176 < X < 224) = P(X < 224) − P(X < 176) = 0.89 > pnorm(224, 200, sigma) - pnorm(176, 200, sigma) # P(176 < X < 224) [1] 0.89 (b) Let Y represent the amount of liquid dispensed in √ 99 cups. That is, Y = i Xi where Xi ∼ N (200, σ). It follows then that Y ∼ N (99 · 200, 99 · σ 2 ). P(Y ≥ 20000) = 1 − P (Y ≤ 20000) = 0.0904. > 1 - pnorm(20000, 99*200, sqrt(99*sigma^2)) [1] 0.0903607 (c) Let W be the number of times the sample mean exceeds 210 out of 6 random samples each of size 5. The probability the sample mean exceeds 210 is P(X > 210) = 1 − P(X ≤ √ 210) = 0.0682 where X ∼ N (200, σ/ 5). P(W ≥ 2) = 1 − P (W ≤ 1) = 0.0581.

K14521_SM-Color_Cover.indd 223

30/06/15 11:47 am

218

Probability and Statistics with R, Second Edition: Exercises and Solutions

> p p [1] 0.06823993 > 1 - pbinom(1, 6, p) [1] 0.05808024

15. The pill weight for a particular type of vitamin follows a normal distribution with a mean of 0.6 grams and a standard deviation of 0.015 grams. It is known that a particular therapy consisting of a box of vitamins with 125 pills is not effective if more than 20% of the pills are under 0.58 grams. (a) Find the probability that the therapy with a box of vitamins is not effective. (b) A supplement manufacturer sells vitamin bottles containing 125 vitamins per bottle with 50 bottles per box with a guarantee that at least 47 bottles per box weigh more than 74.7 grams. Find the probability that a randomly chosen box does not meet the guaranteed weight. Solution: (a) Let X = weight of a particular vitamin. X ∼ N (0.6, 0.015). The therapy is not effective if more than 0.20 × 125 = 25 pills are under 0.58 grams. The probability of a vitamin being underweight is P(X ≤ 0.58) = p =⇒ p = 0.0912. Let W = number of underweight pills. W ∼ Bin(125, p). The probability the therapy is not effective is P(W > 25) = 1 − P (W ≤ 25) = 1e − 04. > p p [1] 0.09121122 > 1 - pbinom(25, 125, p) [1] 5.515032e-05 √ (b) Let Y = weight of a bottle of vitamins. Y ∼ N (125 × 0.6, 125 × 0.0152 ). That is Y ∼ N (75, 0.1677). P(Y > 74.7) = 1 − P (Y ≤ 74.7) = 0.9632. > ans 74.7) > ans [1] 0.9631809 Let V = number of bottles that weigh in excess of 74.7 grams. V ∼ Bin(50, 0.9632) and P(V ≤ 46) = 0.1117. In other words, the probability a randomly selected box does not meet the manufacturers’ guarantee is 0.1117. > pbinom(46, 50, ans)

# P(V > > > > > > + + + > + + + +

set.seed(78) sims + + +

sims + + + > + > + + +

Sampling and Sampling Distributions

221

xbar100 + + > + + + > + >

K14521_SM-Color_Cover.indd 227

xbar300 + + > + + + > + > + + + >

K14521_SM-Color_Cover.indd 228

xbar500 0.45) = P(t33 > −1.55115) = 0.9348

K14521_SM-Color_Cover.indd 229

30/06/15 11:47 am

224

Probability and Statistics with R, Second Edition: Exercises and Solutions

> tobs tobs [1] -1.55115 > 1 - pt(tobs, 33) [1] 0.9347978

19. Plot the density function of an F4,6 random variable. Find the area to the left of x = 3 and shade this region in the original plot. Solution: curve(df(x, 4, 6), from = 0, to = 9, lwd = 2, col = "red", ylab = "",xlab = "") x > > > + > >

0.0

0.1

0.2

0.3

0.4

P(F4, 6 < 3) = 0.8888889

0

2

4

6

8

Similar graph with ggplot:

K14521_SM-Color_Cover.indd 230

30/06/15 11:47 am

Chapter 6: > + + + + + + > > > + + + + + +

Sampling and Sampling Distributions

225

limitRange 0) = 1 − P(Y ≤ 0) = 0.0019. > > > > >

P1 1.54) P2 1.49) PI 1.54| X > 1.49) Ans 149.8) = 0.9772. > 1 - pnorm(149.8, 150, 0.1)

# P(Z > 149.8)

[1] 0.9772499

23. Consider a random sample of size n from an exponential distribution with parameter λ. Use moment generating functions to show that the sample mean follows a Γ(n, λn). Graph the theoretical sampling distribution of X when sampling from an Exp(λ = 1) for n = 30, 100, 300, and 500. Superimpose an appropriate normal density for each Γ(n, λn). At what sample size do the sampling distribution and superimposed density virtually coincide? Solution: −1 For X ∼ Exp(λ), MX (t) = 1 − λt . Also, the moment generating function of a Y ∼ t −n Γ(n, λn) is MY (t) = 1 − λn . n X i n X i X n t· ni Since X = i=1 n , MX (t) = E etX = E et( i=1 n ) = E e i=1 Because the Xi s are independent and identically distributed, −1 −n n n n Xi t t t t· n = E e MX i = 1− = MY (t). 1− = MX (t) = n nλ nλ i=1 i=1 i=1 Note that the sampling distributions of the sample mean are a Γ(30, 30), Γ(100, 100), Γ(300, 300), and Γ(500, 500) for the sample sizes n = 30, 100, 300, and 500, respectively. The normal distribution superimposed over the gamma distributions are N (1, 1/30 ), N (1, 1/100 ), N (1, 1/300 ) , and N (1, 1/500), respectively, since the mean of the gamma is α/λ and the variance is α/λ2 . Code for the graph with Γ(30, 30):

K14521_SM-Color_Cover.indd 233

30/06/15 11:47 am

228

Probability and Statistics with R, Second Edition: Exercises and Solutions

> curve(dgamma(x, 30, 30), from = 1 - 3.5*sqrt(1/30), + to = 1 + 3.5*sqrt(1/30), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/30)), from = 1 - 3.5*sqrt(1/30), + to = 1 + 3.5*sqrt(1/30), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(30, 30))), + expression(N(list(1, sqrt(1/30))))), + text.col=c("blue", "red"), bg = "gray92", cex = 0.90)

0.0

0.5

1.0

1.5

2.0

Γ(30, 30) N(1, 1 30 )

0.4

0.6

0.8

1.0

1.2

1.4

1.6

Code for the graph with Γ(100, 100):

> curve(dgamma(x, 100, 100), from = 1 - 3.5*sqrt(1/100), + to = 1 + 3.5*sqrt(1/100), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/100)), from = 1 - 3.5*sqrt(1/100), + to = 1 + 3.5*sqrt(1/100), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(100, 100))), + expression(N(list(1, sqrt(1/100))))), + text.col=c("blue", "red"), bg = "gray92", cex = 0.90)

K14521_SM-Color_Cover.indd 234

30/06/15 11:47 am

Sampling and Sampling Distributions

229

Γ(100, 100) N(1, 1 100 )

0

1

2

3

4

Chapter 6:

0.7

0.8

0.9

1.0

1.1

1.2

1.3

Code for the graph with Γ(300, 300):

7

> curve(dgamma(x, 300, 300), from = 1 - 3.5*sqrt(1/300), + to = 1 + 3.5*sqrt(1/300), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/300)), from = 1 - 3.5*sqrt(1/300), + to = 1 + 3.5*sqrt(1/300), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(300, 300))), + expression(N(list(1, sqrt(1/300))))), + text.col=c("blue", "red"), bg = "gray92", cex = 0.90)

0

1

2

3

4

5

6

Γ(300, 300) N(1, 1 300 )

0.8

0.9

1.0

1.1

1.2

Code for the graph with Γ(500, 500): > curve(dgamma(x, 500, 500), from = 1 - 3.5*sqrt(1/500), + to = 1 + 3.5*sqrt(1/500), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/500)), from = 1 - 3.5*sqrt(1/500), + to = 1 + 3.5*sqrt(1/500), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(500, 500))),

K14521_SM-Color_Cover.indd 235

30/06/15 11:47 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

230

expression(N(list(1, sqrt(1/500))))), text.col=c("blue", "red"), bg = "gray92", cex = 0.90)

Γ(500, 500) N(1, 1 500 )

0

2

4

6

8

+ +

0.85

0.90

0.95

1.00

1.05

1.10

1.15

24. Set the√seed equal to 10, and simulate 20,000 random samples of size √ nx = 65 from a N (4, σx = 2 ), 20,000 random samples of size ny = 90 from a N (5, σy = 3 ), and verify that the simulated statistic

2 2 Sx /σx Sy2 /σy2

follows an F64,89 distribution.

Solution: There is close agreement between the pink simulated distribution and the blue theoretical F64,89 distribution.

> > > > + + + > + + + + +

K14521_SM-Color_Cover.indd 236

set.seed(10) sims > >

set.seed(95) sims > > > > >

set.seed(368) sims > > >

set.seed(48) sims > > > > > > > >

K14521_SM-Color_Cover.indd 240

set.seed(28) n 1.5). Solution: > > > > > >

K14521_SM-Color_Cover.indd 241

set.seed(37) sims PEb = 1.5) > PEb [1] 0.2664411

K14521_SM-Color_Cover.indd 242

30/06/15 11:47 am

Chapter 6:

Sampling and Sampling Distributions

237

> PTb PTb [1] 0.2648906 > PDb PDb [1] 0.585345

The empirical probability P(F < 2|F > 1.5) = 0.2664, while the theoretical probability P(F1,20 < 2|F1,20 > 1.5) = 0.2649. The percent difference between the empirical and theoretical answers is 0.5853%.

30. Verify empirically that

N (0, 1) 1 2 12 ∼ t5 5 χ5

by setting the seed equal to 36 and generating a sample of size 20,000 from a N (0, 1) distribution. Generate another sample of size 20,000 from a χ25 distribution. Perform the appropriate arithmetic to arrive at the simulated sampling distribution. Create a density histogram of the results and superimpose a theoretical t5 density. Solution:

> > > > > > + + + + +

K14521_SM-Color_Cover.indd 243

set.seed(36) sims 1975 | C1 ) P(C1 ) = 3 i=1 P (W > 1975 | Ci ) P(Ci ) 0.0022 = 0.0084 = 0.2631

P (C1 | W > 1975 ) =

> num num [1] 0.002221388 > den den [1] 0.2630832 > PC1givnW PC1givnW [1] 0.008443672

32. 15.3% of the Spanish Internet domain names are “.org.” If a sample of 2000 Spanish domain names is taken, (a) Calculate the exact probability that at least 300 domain names will be “.org.”. (b) Compute an approximate answer that at least 300 domain names will be “.org.” with a normal approximation. Solution: (a) Let X = number of Spanish Internet domain names that are “.org.”. Then X ∼ Bin(2000, 0.153) and P(X ≥ 300) = 1 − P (X ≤ 299) = 0.6546 > 1 - pbinom(299, 2000, 0.153) [1] 0.6545762 (b) X ∼ N 2000(0.153), 2000(0.153)(1 − 0.153) and P(X ≥ 300) = 1 − P(X ≤ 300) = 0.6453. > 1 - pnorm(300, 2000*0.153, sqrt(2000*0.153*(1 - 0.153))) [1] 0.6453108

33. Set the seed equal to 86, and simulate m1 = 20, 000 samples of size n1 = 1000 from a Bin(n1 , π = 0.3) and m2 = 20, 000 samples of size n2 = 1100 from a Bin(n2 , π = 0.7). Verify that the difference of sampling proportions follows a normal distribution.

K14521_SM-Color_Cover.indd 245

30/06/15 11:47 am

240

Probability and Statistics with R, Second Edition: Exercises and Solutions

Solution: Based on the graph, the shape of the sampling distribution of p1 −p2 is clearly approximately normal. Further, the mean and standard deviation of the simulated sampling distribution are -0.4001 and 0.02 which compare favorably with the theoretical answers of -0.4 and 0.02. > > > > > > > > > > +

set.seed(86) sims + + > + + + + + + + +

K14521_SM-Color_Cover.indd 248

set.seed(679) sims + > + + > > > + + > > + +

K14521_SM-Color_Cover.indd 253

p + + > > > + +

K14521_SM-Color_Cover.indd 255

p lambda plot(lambda, eff(rep(50, length(lambda))), type = "l", + xlab = expression(lambda), ylab = expression(eff(T[2],T[1])), + ylim = c(0.995, 1.005)) > abline(h = 1, lty = "dashed")

0

20

40

60

80

100

λ

K14521_SM-Color_Cover.indd 257

30/06/15 11:47 am

252

Probability and Statistics with R, Second Edition: Exercises and Solutions

(b) T2 is more efficient than T1 regardless of λ for any positive n. For n = 50, eff[T2 , T1 ] = 1.0004. (c) Code to graph eff[T2 , T1 ] for various n: > n plot(n, eff(n), type = "l", xlab = expression(n), + ylab = expression(eff(T[2],T[1])), main = "Relative Efficiency", + col = "red") > abline(h = 1, lty = "dashed")

1.20 1.00

1.10

eff(T2, T1)

1.30

Relative Efficiency

0

5

10

15

20

25

30

n

(d) T2 is moderately more efficient than T1 for small sample sizes. (e) Regardless of the value of λ in a Γ(2, λ) distribution, the estimator T2 is more efficient than T1 . The improvement in efficiency is only marginal and only for relatively small samples. 7. Consider a random variable X ∼ Exp(λ) and two estimators of X: n Xi + 1 . T1 = X and T2 = i=1 n+2

1 λ,

the expected value of

(a) Derive an expression for the relative efficiency of T2 with respect to T1 . (b) Plot eff(T2 , T1 ) versus n values of 1, 2, 3, 4, 20, 25, 30. (c) Generalize your findings. Solution:

K14521_SM-Color_Cover.indd 258

30/06/15 11:47 am

Chapter 7:

Point Estimation

253

(a) For each estimator, expected value, variance, bias, and mean squared error are determined as intermediate steps to the solution of relative efficiency. For T1 : n n 1 ( ) i=1 Xi = nλ = E[T1 ] = E X = E n

1 λ

n 1 n ( ) i=1 Xi = nλ22 = Var[T1 ] = Var X = Var n Bias[T1 ] = E[T1 ] −

1 λ

=

1 λ

−

1 λ

1 nλ2

=0

MSE [T1 ] = Var[T1 ] + (Bias[T1 ])2 =

1 nλ2

+0=

1 nλ2

For T2 : E[T2 ] = E

n

Xi +1 n+2

i=1

Var[T2 ] = Var

n

=

Xi +1 n+2

i=1

Bias[T2 ] = E[T2 ] −

1 λ

=

n

E[Xi ]+1 n+2

i=1

nVar [X] (n+2)2

n λ +1 n+2

−

1 λ

=

MSE [T1 ] MSE [T2 ]

n λ +1 n+2

n λ2

(n+2)2

λ−2 λ(n+2)

=

MSE [T2 ] = Var[T2 ] + (Bias[T2 ])2 = Consequently, eff[T2 , T1 ] =

=

n λ2

(n+2)2

=

+

1 nλ2 n+(λ−2)2 λ2 (n+2)2

λ−2 λ(n+2)

=

2

=

n+(λ−2)2 λ2 (n+2)2

(n+2)2 n(n+λ2 −4λ+4)

(b) The code to plot eff(T2 , T1 ) versus n values of 1, 2, 3, 4, 20, 25, and 30 is

> + + > > > + + > > + + > > + + >

K14521_SM-Color_Cover.indd 259

eff >

xi

π xi (1 − π)120−xi .

parcels > >

X

−n+

√

−n ±

n2 +4n 2n

n n2 + 4n i=1 x2i 2n

n

i=1

Xi2

.

xs >

Point Estimation

261

loglike > > >

Point Estimation

263

x > > > > >

set.seed(11) n > >

Point Estimation

265

set.seed(3) n x, X2 > x, X3 > x)

=1− =1− =1− =1−

3

P (Xi > x)

i=1 3

i=1

(1 − F (x))

3

i=1 3

i=1

1 − 1 − e−x/θ

e−x/θ = 1 − e−3x/θ

3 3 =⇒ f (x) = e−3x/θ , which is an exponential density with parameter θ θ

(b) For an estimator to be considered efficient, its variance must equal the CRLB. That is ? ˆ = Var θ(X)

n·E

= n·E = n·E = n·E

1

∂ ln f (X|θ) ∂θ

1

∂ ln( θ1 e−X/θ ) ∂θ

1

∂ ∂θ

− θ1

+

X θ2

θ4 n · E [(X − θ)2 ] θ4 = n · Var[X] θ2 θ4 = = n · θ2 n =

Since Var θˆ3 (X) =

θ2 n,

2

− ln(θ) −

1

2

2

X θ

2

it is efficient.

(c) θˆ3 (X) is the MLE because it is efficient.

K14521_SM-Color_Cover.indd 275

30/06/15 11:48 am

270

Probability and Statistics with R, Second Edition: Exercises and Solutions

(d) If X ∼ Exp(θ + 2), E[X] = θ + 2. To create an unbiased estimator of θ, a statistic which has an expected value of θ − 2 can be used. Any of θ1 , θ2 , or θ3 minus 2 will yield an unbiased estimator of θ. 20. Consider a random sample of size n from a population of size N , where the items in the population are sequentially numbered from 1 to N . (a) Derive the method of moments estimator of N . (b) Derive the maximum likelihood estimator of N . (c) What are the method of moments and maximum likelihood estimates of N for this sample of size 7: {2, 5, 13, 6, 15, 9, 21}? Solution: set

(a) To derive the method of moments estimator, set α1 = E[X 1 ] = X = m1 . E[X] =

N i=1

N (N + 1) 1 N +1 1 = · = N 2 N 2

xi ·

N +1 =X 2 N + 1 = 2X N = 2X − 1 = 2X − 1 =⇒ N

(b)

P (X = x) = This means the likelihood function is L(N |x) =

1 Nn

0

1 N

0

if x = 1, . . . , N otherwise

N ≥ max{x1 , . . . , xn } otherwise

Since this function decreases as N increases, the smallest value of N which produces a (x) = max{x1 , . . . , xn }. non-zero value is N

(c) > > > >

x

lifetimes mle mle [1] 0.199561

ˆ For this sample, λ(x) = 0.1996. (c)

> + + + > >

loglike LBWT LBWT low 0 130

1 59

> prop.table(LBWT)

K14521_SM-Color_Cover.indd 279

30/06/15 11:48 am

274

Probability and Statistics with R, Second Edition: Exercises and Solutions

low 0 1 0.6878307 0.3121693 (c) The maximum likelihood estimator of π for a Bernoulli distribution is X. Consequently, the mle is x ¯ = 0.3122. (d) The maximum likelihood estimate for children born with weight problems is approximately 31% for this particular hospital. 23. In 1876, Charles Darwin had his book The Effect of Cross- and Self-Fertilization in the Vegetable Kingdom published. Darwin planted two seeds, one obtained by cross-fertilization and the other by auto-fertilization, in two opposite but separate locations of a pot. Selffertilization, also called autogamy or selfing, is the fertilization of a plant with its own pollen. Cross-fertilization, or allogamy, is the fertilization with pollen of another plant, usually of the same species. Darwin recorded the plants’ heights in inches. The data frame FERTILIZE from the PASWR2 package contains the data from this experiment.

Cross-fert

23.5 12.0 21.0 22.0 19.1 21.5 22.1 20.4 18.3 21.6 23.3 21.0 22.1 23.0 12.0 Self-fert 17.4 20.4 20.0 20.0 18.4 18.6 18.6 15.3 16.5 18.0 16.3 18.0 12.8 15.5 18.0

(a) Create a variable DD defined as the difference between the variables cross and self. (b) Perform an exploratory analysis of DD to see if DD might follow a normal distribution. (c) Use the function fitdistr() found in the MASS package to obtain the maximum likelihood estimates of µ and σ if DD did follow a normal distribution. (d) Verify that the results from (c) are the sample mean and the uncorrected sample standard deviation of DD. Solution: (a)

K14521_SM-Color_Cover.indd 280

> > + + >

TwoColumns shapiro.test(ThreeColumns$DD) Shapiro-Wilk normality test data: ThreeColumns$DD W = 0.90079, p-value = 0.09785 10

0.09

density

sample

5

0.06

0

0.03 −5

0.00 −5

0

DD

5

10

−2

−1

0

theoretical

1

2

Based on the density estimate, the quantile-quantile plot, and the Shapiro-Wilk normality test (using an α level of 0.05) one might assume normality for the distribution of DD. (c) > library(MASS) > ans ans mean sd 2.6166667 4.5580667 (1.1768878) (0.8321853) > MU MU mean 2.616667 > SIGMA SIGMA

K14521_SM-Color_Cover.indd 281

30/06/15 11:48 am

276

Probability and Statistics with R, Second Edition: Exercises and Solutions

sd 4.558067 The maximum likelihood estimates of µ and σ using the function fitdistr() assuming a normal distribution are: 2.6167 and 4.5581, respectively. (d) The following code verifies that the numbers 2.6167 and 4.5581 are the sample mean and the uncorrected sample standard deviation of DD. > with(data = ThreeColumns, + c(mean(DD), sqrt(var(DD)*(length(DD) - 1)/length(DD)) ) + ) [1] 2.616667 4.558067

24. The lognormal distribution has the following density function: g(w) =

1 √

wσ 2π

1 (ln w − µ)2 2 2σ , e −

w ≥ 0,

−∞ < µ < ∞,

σ>0

where ln(W ) ∼ N (µ, σ). The mean and variance of W are, respectively, E[W ] = eµ+

σ2 2

and

2

2

Var[W ] = e2µ+σ (eσ − 1).

Find the maximum likelihood estimators for E[W ] and Var[W ]. Solution: Given the MLEs for a normal distribution with mean µ and variance σ 2 are µ ˆ(X) = X and n

(X −µ)2

= Su2 , respectively, and the fact that MLEs are invariant, it follows σ ˆ 2 (X) = i=1 n i that the MLEs for the mean and variance of W are 2 E[w] = eX+Su /2

and

2 2 Var[w] = e2X+Su (eSu − 1).

25. Consider the variable brain from the Animals data frame in the MASS package. (a) Estimate with maximum likelihood techniques the mean and variance of brain. Specifically, use the R function fitdistr() with a lognormal distribution. (b) Suppose that brain is a lognormal variable; then the log of this variable is normal. To check this assertion, plot the cumulative distribution function of brain versus a lognormal cumulative distribution function. In another plot, represent the cumulative distribution function of log-brain versus a normal cumulative distribution function. Is it reasonable to assume that brain follows a lognormal distribution? (c) Find the mean and standard deviation of brain assuming a lognormal distribution. (d) Repeat this exercise without the dinosaurs. Comment on the changes in the mean and variance estimates.

K14521_SM-Color_Cover.indd 282

30/06/15 11:48 am

Chapter 7:

Point Estimation

277

Solution: (a) The MLEs for the mean and variance of W are

2 E[w] = eX+Su /2

and

2 2 Var[w] = e2X+Su (eSu − 1).

> library(MASS) > mle mle meanlog sdlog 4.4254457 2.3560485 (0.4452513) (0.3148402) > EW VW c(EW, VW) meanlog meanlog 1.340674e+03 4.610096e+08

The maximum likelihood estimates of the mean and standard deviation for the logarithm of the variable brain are 4.4254 and 2.356, respectively. It follows using invariance properties of MLEs that the estimates of the mean and variance of brain are 1340.6744 kg and 461009623.032 kg2 , respectively. (b)

> ggplot(data = Animals, aes(x = brain)) + stat_ecdf() + + stat_function(fun = plnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw() > # > ggplot(data = Animals, aes(x = log(brain))) + stat_ecdf() + + stat_function(fun = pnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw()

K14521_SM-Color_Cover.indd 283

30/06/15 11:48 am

278

Probability and Statistics with R, Second Edition: Exercises and Solutions

0.75

0.75

0.50

0.50

y

1.00

y

1.00

0.25

0.25

0.00

0.00 0

2000

brain

4000

6000

0.0

2.5

5.0

log(brain)

7.5

It seems reasonable to assume brain follows a lognormal distribution based on the graphs. (c) In agreement with part (a), the estimated mean and variance of the variable brain are 1340.6744 kg and 461009623.032 kg2 , respectively. > ans ans [1] 4.425446 5.756556 > xbar V c(xbar, V) [1] 4.425446 5.756556 > > > >

VU ggplot(data = NoDinos, aes(x = brain)) + stat_ecdf() + + stat_function(fun = plnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw() > # > ggplot(data = NoDinos, aes(x = log(brain))) + stat_ecdf() + + stat_function(fun = pnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw()

0.75

0.75

0.50

0.50

y

1.00

y

1.00

0.25

0.25

0.00

0.00 0

2000

brain

4000

6000

0.0

2.5

5.0

log(brain)

7.5

It seems reasonable to assume brain still follows a lognormal distribution after removing the three dinosaurs based on the graphs. In agreement with R Code 7.3 on the preceding page, the estimated mean and variance of the variable brain are 1851.1266 kg and 1668525179.2198 kg2 , respectively. > ans ans [1] 4.428471 6.448081 > xbar V c(xbar, V) [1] 4.428471 6.448081 > VU EW VW c(EW, VW) [1] 1.851127e+03 1.668525e+09 After removing the three dinosaurs, the estimate of the mean brain weight has increased while the estimate of the variance of the brain weight has decreased. 26. The data in GD available in the PASWR2 package are the times until failure in hours for a particular electronic component subjected to an accelerated stress test.

(a) Find the method of moments estimates of α and λ if the data come from a Γ(α, λ) distribution.

(b) Create a density histogram of times until failure. Superimpose a gamma distribution using the estimates from part (a) over the density histogram.

(c) Find the maximum likelihood estimates of α and λ if the data come from a Γ(α, λ) distribution by using the function fitdistr() from the MASS package.

(d) Create a density histogram of times until failure. Superimpose a gamma distribution using the estimates from part (c) over the density histogram.

(e) Plot the cumulative distribution for time until failure. Superimpose the theoretical cumulative gamma distribution using both the method of moments and the maximum likelihood estimates of α and λ. Which estimates appear to model the data better?

Solution: (a) When X ∼ Γ(α, λ), E[X] =

α λ

and Var[X] =

α λ2 .

To find the MOM estimates, the system of equations to be solved is

n α set i=1 Xi α1 (α, λ) = E[X] = = m1 = λ n n 2 2 α α 2 set i=1 Xi = m2 α2 (α, λ) = E[X 2 ] = Var[X] + E[X] = 2 + = λ λ n From the first equation, α = λ gives

K14521_SM-Color_Cover.indd 286

λ

n

i=1

n

Xi

. Substituting into the second equation to solve for

30/06/15 11:48 am

Chapter 7:

Point Estimation

281

n α 2 2 α i=1 Xi + = 2 λ λ n n 2 2 λ i=1 Xi α + α2 = n n n n 2 2 2 λ i=1 Xi λ i=1 Xi λ i=1 Xi + = n n n n n 2 n 2 i=1 Xi i=1 Xi i=1 Xi =λ − n n n X = λSu2 λ=

X Su2

˜= X =⇒ λ Su2 Solving for α now gives α= α=

λ

n

Xi n n

i=1

X · Su2 2

α=

X Su2

=⇒ α ˜=

X Su2

i=1

Xi

n

2

> estimates estimates [1] 10.52530 11.68381 > Alpha Lambda c(Alpha, Lambda) [1] 9.4816647 0.9008451

2

X 10.52532 α ˜= 2 = = 9.4817 Su 11.6838 ˜ = X = 10.5253 = 0.9008 λ Su2 11.6838 (b)

K14521_SM-Color_Cover.indd 287

30/06/15 11:48 am

282

Probability and Statistics with R, Second Edition: Exercises and Solutions

> ggplot(data = GD, aes(x = attf)) + + geom_histogram(aes(x = attf, y = ..density..), binwidth = 1.5, + fill = "blue", alpha = 0.3) + + stat_function(fun = dgamma, args = list(Alpha, Lambda), + color = "blue") + + theme_bw()

0.125

0.100

density

0.075

0.050

0.025

0.000 5

10

attf

15

20

25

(c) > library(MASS) > mle mle shape rate 9.3961672 0.8918924 (1.3058899) (0.1273251) The maximum likelihood estimates of α, and λ are 9.3962 and 0.8919, respectively. (d) > ggplot(data = GD, aes(x = attf)) + + geom_histogram(aes(x = attf, y = ..density..), binwidth = 1.5, + fill = "red", alpha = 0.3) + + stat_function(fun = dgamma, + args = list(mle$estimate[1], mle$estimate[2]), + color = "red") + + theme_bw()

K14521_SM-Color_Cover.indd 288

30/06/15 11:48 am

Chapter 7:

Point Estimation

283

0.125

0.100

density

0.075

0.050

0.025

0.000 5

10

attf

15

20

25

(e)

> ggplot(data = GD, aes(x = attf)) + + stat_ecdf() + + stat_function(fun = pgamma, args = list(Alpha, Lambda), + color = "blue") + + stat_function(fun = pgamma, + args = list(mle$estimate[1], mle$estimate[2]), + color = "red") + + theme_bw()

K14521_SM-Color_Cover.indd 289

30/06/15 11:48 am

284

Probability and Statistics with R, Second Edition: Exercises and Solutions 1.00

0.75

y

0.50

0.25

0.00 5

10

15

attf

20

The cumulative density using method of moment estimates for α and λ is visually indistinguishable from the cumulative density using maximum likelihood estimates for α and λ.

27. The time a client waits to be served by the mortgage specialist at a bank has density function f (x) =

1 2 −x/θ x e 2θ3

x > 0, θ > 0.

(a) Derive the maximum likelihood estimator of θ for a random sample of size n.

(b) Show that the estimator derived in (a) is unbiased and efficient.

(c) Derive the method of moments estimator of θ.

(d) If the waiting times of 15 clients are 6, 12, 15, 14, 12, 10, 8, 9, 10, 9, 8, 7, 10, 7, and 3 minutes, compute the maximum likelihood estimate of θ.

Solution:

K14521_SM-Color_Cover.indd 290

30/06/15 11:48 am

Chapter 7:

Point Estimation

285

(a) L(θ|x) =

n 1 2 − xi x e θ 2θ3 i i=1

n 1 2 − xi x e θ = n 3n 2 θ i=1 i

ln L(θ|x) = −n ln 2 − 3n ln θ + 2 3n ∂ ln L(θ|x) =− + ∂θ θ

n

i=1 θ2

xi

=⇒ −3nθ = − θ=

n i=1

ln xi −

n

i=1

xi

θ

set

=0

n

xi

i=1 n i=1

xi

3n

X ˆ =⇒ θ(X) = 3

To verify this is a maximum value, take the second partial and show it is less than zero. n 3n ∂ 2 ln L(θ|x) i=1 xi = − 2 ∂θ2 θ2 θ3 ˆ x¯ θ= 3

27n 2n¯ x − 3 x ¯2 x ¯ /27 27n 54n = 2 − 2 < 0 for all n. x ¯ x ¯

=

ˆ (b) Show that θ(X) = Unbiased:

X 3

is unbiased and efficient.

E[X] =

∞

0

Let u = xθ , and note that du =

x·

1 2 −x/θ x e dx 2θ3

dx θ

=

∞ 0

u3 −u e θ du 2

θ = Γ(4) 2 θ = 3! 2 E[X] = 3θ

K14521_SM-Color_Cover.indd 291

30/06/15 11:48 am

286

Probability and Statistics with R, Second Edition: Exercises and Solutions

n E[Xi ] n · 3θ X ˆ = i=1 = =θ E θ(X) = E 3 3n 3n

ˆ Therefore, θ(X) =X 3 is an unbiased estimator for θ. For later calculations, E[X 2 ] is also needed. E[X 2 ] = Let u = xθ , and note that du =

∞

1 2 −x/θ x e dx 2θ3

x2 ·

0

dx θ

=θ

∞

u4 −u e θ du 2

0

θ2 = Γ(5) 2 θ2 = 4! 2 2 E[X ] = 12θ2 Var[X] = E[X 2 ] − (E[X])2 = 12θ2 − (3θ)2 = 3θ2 For an estimator to be considered efficient, its variance must equal the CRLB. That is ? ˆ Var θ(X) = Var

1 · Var 9

n

i=1

n

X ? = 3

Xi

n·E

n·E

?

∂ ln f (X|θ) ∂θ

1

2

∂ ln( 2θ13 X 2 e−X/θ ) ∂θ

2

n·E

n·E

1

=

n 1 ? · Var Xi = 9n2 i=1

1

∂ ∂θ

− ln 2 − 3 ln θ + 2 ln X −

1

− θ3 +

X θ2

n 1 θ4 ? · Var[X ] = i 9n2 i=1 n · E [(X − 3θ)2 ]

2

X θ

2

θ4 1 2 ? · n · 3θ = 9n2 n · Var[X] 2 θ2 θ θ4 = = 3n n · 3θ2 3n

ˆ So, θ(X) =

K14521_SM-Color_Cover.indd 292

X 3

is an efficient estimator of θ.

30/06/15 11:48 am

Chapter 7:

Point Estimation

287

(c) To derive the method of moments estimator, solve set

α1 = E[X] = 3θ = X = m1 X 3 X =⇒ θ˜ = 3 θ=

(d) > > > >

waiting + + > >

loglike 0.

(a) What distribution has this density function? Be sure to specify the parameter. (b) Find the maximum likelihood estimator of θ for random samples of size n. (c) Find the asymptotic variance of the maximum likelihood estimator. (d) Find the method of moments estimator of θ for a random sample of size n. (e) Calculate the maximum likelihood and method of moments estimates of θ for the sample {0.1, 0.7, 0.5, 0.85, 0.9}.

Solution: (a) It is a β(α = θ, β = 1, A = 0, B = 1). α−1 β−1 Γ(α + β) x−A B−x 1 · · f (x) = B − A Γ(α)Γ(β) B−A B−a θ−1 1−1 Γ(θ + 1) 1 x−0 1−x · · f (x) = 1 − 0 Γ(θ)Γ(1) 1−0 1−0 θΓ(θ) θ−1 0 · (x) f (x) = (1 − x) Γ(θ)Γ(1) f (x) = θxθ−1

(b) To find the MLE, set the partial of the log likelihood equal to zero, and solve for θ.

K14521_SM-Color_Cover.indd 300

30/06/15 11:48 am

Chapter 7:

Point Estimation

L(θ|x) = θn

n

295

xθ−1 i

i=1

ln L(θ|x) = n ln(θ) + (θ − 1) n

n

ln(xi )

i=1

∂ ln L(θ|x) n set = + ln(xi ) = 0 ∂θ θ i=1 −n i=1 ln(xi ) −n ˆ =⇒ θ(X) = n i=1 ln(Xi ) =⇒ θ = n

(c) The lower bound on the variance is the reciprocal of the information In (θ)−1 .

∂ 2 ln L(θ|x) In (θ) = −E ∂θ2 −n =− θ2 2 θ =⇒ In (θ)−1 = n

(d) Finding the MOM estimator: Recall that the mean of a beta distribution is α θ θ A + (B − A) = 0 + (1 − 0) = α+β θ+1 θ+1

(7.1)

for this particular distribution. set

α1 = E[X] = X = m1 θ =X θ+1 θ = θX + X θ(1 − X) = X

X 1−X X =⇒ θ˜ = 1−X θ=

(e) Calculations: > > > >

x > >

x f ggplot(data = data.frame(x = x), aes(x = x)) + + geom_histogram(aes(x = x, y = ..density..), fill = "red", + alpha = 0.5) + + theme_bw() + + stat_function(fun = f)

K14521_SM-Color_Cover.indd 306

30/06/15 11:48 am

Chapter 7:

Point Estimation

301

3

density

2

1

0 0.00

0.25

0.50

x

0.75

(d) To calculate the MLE of θ, take the partial of the log likelihood, set it equal to zero, and solve for θ.

L(θ|x) =

n

3

3πθx2i e−θπxi

i=1

L(θ|x) = (3πθ)n e−θπ

n

i=1

x3i

n

x2i

i=1

ln L(θ|x) = n ln(3π) + n ln(θ) − θπ n n ∂ ln L(θ|x) set = −π x3i = 0 ∂θ θ i=1

n n −π x3i = 0 θ i=1

n = θπ

n

n

x3i

i=1

+

n

2 ln(xi )

i=1

x3i

i=1

=⇒ θ = ˆ =⇒ θ(X) = For the generated sample,

K14521_SM-Color_Cover.indd 307

π

n n

i=1

n

π

n

i=1

x3i Xi3

30/06/15 11:48 am

302

Probability and Statistics with R, Second Edition: Exercises and Solutions

> n mle mle [1] 5.038159

ˆ θ(X) = (e)

π

n n

3 i=1 Xi

=

500 = 5.0382. (3.1416)(1263.5961)

> loglike > > > + + + >

set.seed(8675) n library(MASS) > fitdistr(x = x, densfun = "beta", start = list(shape1 = 1 , shape2 = 1)) Warning in densfun(x, parm[1], parm[2], ...): Warning in densfun(x, parm[1], parm[2], ...):

NaNs produced NaNs produced

shape1 shape2 3.0810430 2.1497909 (0.2422334) (0.1633135)

38. Consider a random sample of size n from an exponential distribution with pdf f (x) =

1 −x e θ θ

x ≥ 0,

θ > 0.

(7.2)

(a) Find the MLE of θ. (b) Given the answer in part (a), what is the MLE of θ2 ? Solution:

K14521_SM-Color_Cover.indd 310

30/06/15 11:48 am

Chapter 7:

Point Estimation

305

(a) To find the MLE of θ, set the partial of the log-likelihood function equal to zero and solve for θ. 1 −x e θ θ n 1 − xi L(θ|x) = n e θ θ i=1 f (x) =

=

1 − ni=1 e θn

ln L(θ|x) = −n ln(θ) − n

xi θ

n xi i=1

θ

n xi set ∂ ln L(θ|x) =− + =0 ∂θ θ i=1 θ2

n

−

n xi set + =0 θ i=1 θ2 −nθ = −

n

θ=

i=1 n i=1

ˆ =⇒ θ(X) =

i=1

nn

n

xi xi Xi

=X 2

(b) Since X is the MLE of θ, using the invariance property of MLEs, the MLE of θ2 is X .

K14521_SM-Color_Cover.indd 311

30/06/15 11:48 am

K14521_SM-Color_Cover.indd 312

30/06/15 11:48 am

Chapter 8 Confidence Intervals

1. Is [¯ x − 3, x ¯ + 3] a confidence interval for the population mean of a normal distribution? Why or why not? Solution: The interval [¯ x − 3, x ¯ + 3] is a confidence interval since it takes the form of point estimate minus and plus the margin of error. 2. Explain how to construct a 95% confidence interval for the population mean of a normal distribution if σ is known. Solution: Using the confidence interval for µ when sampling from a normal distribution with known variance, one would compute the sample mean and substitute that value for x ¯ as well as 1.96 for z1−α/2 into the equation along with the values for σ and n. 3. Given a random sample {X1 , X2 , . . . , Xn } from a normal population N (µ, σ), where σ is known: (a) What is the confidence level for the interval x ¯ ± 2.053749 √σn ? (b) What is the confidence level for the interval x ¯ ± 1.405072 √σn ? (c) What is the value of the percentile zα/2 for a 99% confidence interval? Solution: (a) The confidence level is 0.96. > 1 - pnorm(-2.053749)*2 [1] 0.96 (b) The confidence level is 0.84. > 1 - pnorm(-1.405072)*2 [1] 0.8400001 (c) To answer the problem, one must find the value of z0.005 = −2.5758. > qnorm(0.005) [1] -2.575829

307

K14521_SM-Color_Cover.indd 313

30/06/15 11:48 am

308

Probability and Statistics with R, Second Edition: Exercises and Solutions

4. Given a random sample {X1 , X2 , . . . , Xn } from a normal population N (µ, σ), where σ is known, consider the confidence interval x ¯ ± z1−α/2 √σn for µ. (a) Given a fixed sample size n, explain the relationship between the confidence level and the precision of the confidence interval. (b) Given a confidence level (1 − α)%, explain how the precision of the confidence interval changes with the sample size. Solution: (a) As the confidence level for the confidence interval increases so does the value z1−α/2 , which subsequently increases the width of the confidence interval, thereby decreasing precision of the confidence interval. In a similar fashion, as the confidence level decreases, the confidence interval width decreases, increasing the precision of the confidence interval. (b) With a fixed confidence level, as the size of the sample (n) increases, the margin of error decreases, producing a more precise confidence interval. Likewise, as the sample size (n) decreases, the margin of error increases, resulting in a less precise confidence interval. 5. Given a normal population with known variance σ 2 , by what factor must the sample size be increased to reduce the length of a confidence interval for the mean by a factor of k? Solution: The length of a confidence interval for the mean given a normal population with known variance σ 2 is 2 · z1−α/2 √σn . Reducing the length by a factor of k implies that the ratio of the original length to the new length will equal k. σ 2 · z1−α/2 √noriginal

=k 2 · z1−α/2 √nσnew √ noriginal =k √ nnew =⇒ nnew = k 2 · noriginal

Therefore, to reduce the length of a confidence interval by a factor of k requires a sample size of k 2 noriginal . For example, to reduce the length of a confidence interval by a factor of 3 requires a sample size of 9n. 6. A historic data set studied by R.A. Fisher is the measurements in centimeters of four flower parts (sepal length, sepal width, petal length, and petal width) on 50 specimens for each of three species of irises (Setosa, Versicolor, and Virginica). The data are stored in the data frame iris (Fisher, 1936). (a) Analyze the sepal lengths for Setosa, Versicolor, and Virginica irises, and comment on the characteristics of their distributions. (b) Based on the analysis from part (a), construct an appropriate 99% confidence interval for the mean sepal length of Setosa irises.

K14521_SM-Color_Cover.indd 314

30/06/15 11:48 am

Chapter 8:

Confidence Intervals

309

Solution: (a) ggplot(data = iris, aes(x = Sepal.Length)) + geom_density(fill = "yellow", alpha = 0.5) + facet_grid(Species ~ .) + theme_bw() ggplot(data = iris, aes(sample = Sepal.Length)) + stat_qq() + facet_grid(Species ~ .) + theme_bw() means + + + > > >

6 5

5

6

Sepal.Length

7

8

−2

−1

0

theoretical

1

2

The sepal lengths for Setosa irises are symmetric and unimodal centered at 5.006 cm with a standard deviation of 0.3525 cm. The sepal lengths for Versicolor irises are symmetric and unimodal centered at 5.936 cm with a standard deviation of 0.5162 cm. The sepal lengths for Virginica irises are symmetric and unimodal centered at 6.588 cm with a standard deviation of 0.6359 cm. (b) > CI CI [1] 4.872406 5.139594 attr(,"conf.level") [1] 0.99 A 99% confidence interval is [4.8724, 5.1396] for the mean sepal length of Setosa irises.

K14521_SM-Color_Cover.indd 315

30/06/15 11:48 am

310

Probability and Statistics with R, Second Edition: Exercises and Solutions

7. Surface-water salinity measurements were taken in a bottom-sampling project in Whitewater Bay, Florida. These data are stored in the data frame SALINITY in the PASWR2 package. Geographic considerations lead geologists to believe that the salinity variation should be normally distributed. If this is true, it means there is free mixing and interchange between open marine water and fresh water entering the bay (Davis, 1986). (a) Construct a quantile-quantile plot of the data. Does this plot rule out normality? (b) Construct a 95% confidence interval for the mean salinity variation. Solution: (a) The quantile-quantile plot does not rule out normality. > ggplot(data = SALINITY, aes(sample = salinity)) + + stat_qq() + + theme_bw() 80

70

sample

60

50

40

−2

−1

0

theoretical

1

2

(b) > CI CI [1] 46.85025 52.23308 attr(,"conf.level") [1] 0.95 A 95% confidence interval for the mean salinity variation is [46.8502, 52.2331]. 8. The survival times in weeks for 20 male rats that were exposed to a high level of radiation are

K14521_SM-Color_Cover.indd 316

30/06/15 11:48 am

Chapter 8: 152 125

152 40

115 128

109 123

Confidence Intervals 137 136

88 101

94 77 62 153

311 160 83

165 69

Data are from Lawless (1982) and are stored in the data frame RAT. (a) Construct a quantile-quantile plot of the survival times. Based on the quantile-quantile plot, can normality be ruled out? (b) Construct a 92% confidence interval for the average survival time for male rats exposed to high levels of radiation. Solution: (a) The quantile-quantile plot does not rule out normality. > ggplot(data = RAT, aes(sample = survival.time)) + + stat_qq() + + theme_bw()

160

sample

120

80

40 −2

−1

0

theoretical

1

2

(b) > CI CI [1] 98.6486 128.2514 attr(,"conf.level") [1] 0.92 A 92% confidence interval for the mean rat survival time for male rats exposed to high levels of radiation is [98.6486, 128.2514]. 9. A large company wants to estimate the proportion of its accounts that are paid on time.

K14521_SM-Color_Cover.indd 317

30/06/15 11:48 am

312

Probability and Statistics with R, Second Edition: Exercises and Solutions

(a) How large a sample is needed to estimate the true proportion within 3% with a 96% confidence level? (b) Suppose 650 out of 800 accounts are paid on time. Construct 95% confidence intervals for the true proportion of accounts that are paid on time using an asymptotic confidence interval, a score confidence interval, an Agresti-Coull confidence interval, and the Clopper-Pearson confidence interval. Solution: (a) > > > > >

p binom.confint(x = 650, n = 800, conf.level = 0.95, methods = "asymptotic") method x n mean lower upper 1 asymptotic 650 800 0.8125 0.7854532 0.8395468 > > > > > >

x > >

ntilde CI CI [1] -1.054064 1.659598 attr(,"conf.level") [1] 0.95 A 95% confidence interval for the true average difference between females and males is [−1.0541, 1.6596]. Note that this interval does contain 0 indicating that the there is not enough evidence to suggest there are actual gender differences with respect to temperature. (b) > CI CI [1] -2.496501 -0.256440 attr(,"conf.level") [1] 0.95 A 95% confidence interval for the true average difference between students taking their temperatures at 8 a.m. and students taking their temperatures at 9 a.m. is [−2.4965, −0.2564]. Note that this interval does not contain 0, indicating that the there is evidence to suggest students in the 8 a.m. class have temperatures that are not as warm as the 9 a.m. class. One possible explanation is that students roll straight out of bed and into the 8 a.m. class. Consequently, their temperatures are closer to their sleeping temperatures which are lower than their waking temperature’s. 11. The Cosmed K4b2 is a portable metabolic system. A study at Appalachian State University compared the metabolic values obtained from the Cosmed K4b2 to those of a reference unit (Amatek) over a range of workloads from easy to maximal to test the validity and reliability of the Cosmed K4b2 . A small portion of the results for VO2 (ml/kg/min) measurements taken at a 150 watt workload are stored in data frame COSAMA and in the following table: Subject 1 2 3 4 5 6 7

Cosmed 31.71 33.96 30.03 24.42 29.07 28.42 31.90

Amatek 31.20 29.15 27.88 22.79 27.00 28.09 32.66

Subject 8 9 10 11 12 13 14

Cosmed 30.33 30.78 30.78 31.84 22.80 28.99 30.80

Amatek 27.95 29.08 28.74 28.75 20.20 29.25 29.13

(a) Construct a quantile-quantile plot for the between-system differences. (b) Are the VO2 values reported for Cosmed and Amatek independent?

K14521_SM-Color_Cover.indd 320

30/06/15 11:48 am

Chapter 8:

Confidence Intervals

315

(c) Construct a 95% confidence interval for the average VO2 system difference. Solution: (a) Based on the quantile-quantile plot of the differences between the Cosmed and Amatek VO2 values, there is no reason to rule out normality. > COSAMA ggplot(data = COSAMA, aes(sample = DIFF)) + + stat_qq(color = "red", size = 3) + + theme_bw() 5

4

sample

3

2

1

0

−1

−1

0

theoretical

1

(b) The reported Cosmed and Amatek values are dependent as each subject in the study has both a Cosmed and an Amatek VO2 score. (c)

K14521_SM-Color_Cover.indd 321

30/06/15 11:48 am

316

Probability and Statistics with R, Second Edition: Exercises and Solutions

> CI CI [1] 0.8746899 2.5353101 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the average VO2 differenceis [0.8747, 2.5353]. 12. Let {X1 , . . . , X19 } and {Y1 , . . . , Y15 } be two random samples from a N (µX , σ) and a ¯ = 57.3, s2X = 8.3, y¯ = 65.6, and s2Y = 9.7. Find a N (µY , σ), respectively. Suppose that x 96% confidence interval for µX , µY , and µX − µY . Solution: > > > > > > > + >

mean.x

xbar1 > > > > >

N > >

N > > > >

N CISIM CIsimRV CI CI [1] 0.07262507 0.11694046 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of schizophreniform patients admitted to Virgen del Camino is [0.0726, 0.1169]. > CI CI [1] 0.04094285 0.07631626 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of schizoaffective patients admitted to Virgen del Camino is [0.0409, 0.0763]. > CI CI [1] 0.07667093 0.12193291 attr(,"conf.level") [1] 0.95

K14521_SM-Color_Cover.indd 348

30/06/15 11:49 am

Chapter 8:

Confidence Intervals

343

The 95% confidence interval for the proportion of bipolar patients admitted to Virgen del Camino is [0.0767, 0.1219]. > CI CI [1] 0.02455615 0.05353698 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of delusional patients admitted to Virgen del Camino is [0.0246, 0.0535]. > CI CI [1] 0.06324816 0.10522800 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of psychotic patients admitted to Virgen del Camino is [0.0632, 0.1052]. > CI CI [1] 0.03455100 0.06764427 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of atypical psychosis patients admitted to Virgen del Camino is [0.0346, 0.0676]. 36. Find the required sample size (n) to estimate the proportion of students spending more than e 10 a week on entertainment with a 95% confidence interval so that the margin of error is no more than 0.02. Solution:

n = p(1 − p)

z

1−α/2

B

2

= 0.5(1 − 0.5)

z

0.975

0.02

2

= 0.25

1.96 0.02

2

= 2400.9118

> n c(n, ceiling(n)) [1] 2400.912 2401.000 > # or > nsize(b = 0.02, p = 0.5, conf.level = 0.95, type = "pi") The required sample size (n) to estimate the population proportion of successes with a 0.95 confidence interval so that the margin of error is no more than 0.02 is 2401 .

K14521_SM-Color_Cover.indd 349

30/06/15 11:49 am

344

Probability and Statistics with R, Second Edition: Exercises and Solutions

One must sample at least 2401 students to be 95% confident the margin of error in estimating the true proportion of students spending more than e 10 a week on entertainment is no more than 0.02.

K14521_SM-Color_Cover.indd 350

30/06/15 11:49 am

Chapter 9 Hypothesis Testing

1. Define α and β for a test of hypothesis. What is the quantity 1 − β called? Solution: The probability of making a type I error (rejecting the null hypothesis when it is true) is α, while β is the probability of making a type II error (failing to reject the null hypothesis when it it false). The probability of rejecting the null hypothesis when it is false is 1 − β, which is known as the power of the test. 2. How can β be made small in a given hypothesis test with fixed α? Solution: Increase the sample size. 3. Using a 5% significance level, what is the power of the test H0 : µ = 100 versus H1 : µ = 100 if a sample of size 36 is taken from a N(120, 50)? Solution: > > > > > > >

alpha Greater90 = 90) > with(data = Greater90, + eda(totalprice) + ) > n 60, 0002 . Step 2: Test Statistic — The test statistic chosen is S 2 because E S 2 = σ 2 . > TS TS [1] 3822980710 The value of this test statistic is s2 = 3822980710.3638. The standardized 2test ∼ statistic under the assumption that H0 is true and its distribution are (n−1)S σ2

χ2n−1 .

0

Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed χ293 and H1 is an upper one-sided hypothesis, the rejection region is χ2obs > χ20.95; 94−1 = 116.511. The value of the standardized test statistic is χ2obs = 98.7603. > RR RR [1] 116.511

K14521_SM-Color_Cover.indd 354

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

349

> STS STS [1] 98.76034

Step 4: Statistical Conclusion — The ℘-value is P(χ293 ≥ 98.7603) = 0.3218. > pvalue pvalue [1] 0.3218218

I. From the rejection region, fail to reject H0 because χ2obs = 98.7603 is less than 116.511. II. From the ℘-value, fail to reject H0 because the ℘-value 0.3218 is greater than 0.05.

Fail to reject H0 . Step 5: English Conclusion — There is insufficient evidence to suggest the variance for the appraised price of 90m2 or larger pisos is greater than 60, 0002 e 2 .

6. The Hubble Space Telescope was put into orbit on April 25, 1990. Unfortunately, on June 25, 1990, a spherical aberration was discovered in Hubble’s primary mirror. To correct this, astronauts had to work in space. To prepare for the mission, two teams of astronauts practiced making repairs under simulated space conditions. Each team of astronauts went through 15 identical scenarios. The times to complete each scenario were recorded in days. Is one team better than the other? If not, can both teams complete the mission in less than 3 days? Use a 5% significance level for all tests. The data are stored in the data frame HUBBLE. Solution: Note that each team of astronauts went through 15 identical scenarios. Consequently, the repair times for the two teams are dependent. Start the analysis by verifying the normality assumption required to use a paired t-test. > Diff eda(Diff)

K14521_SM-Color_Cover.indd 355

30/06/15 11:49 am

350

Probability and Statistics with R, Second Edition: Exercises and Solutions

EXPLORATORY DATA ANALYSIS Histogram of Diff

Density of Diff

Boxplot of Diff

Q−Q Plot of Diff

The results from applying the function eda() to the differences between team1 and team2 suggest it is not unreasonable to assume the repair time differences between team1 and team2 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if the average difference in repair times for team1 and team2 are different, the hypotheses are H0 : µD = 0 versus H1 : µD = 0

Step 2: Test Statistic — The test statistic chosen is D because E D = µD . > dbar dbar [1] -0.1 The value of this test statistic is d¯ = −0.1. The standardized test statistic under √0 the assumption that H0 is true and its distribution are SDD−δ / nD ∼ t15−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is a two-sided hypothesis, the rejection region is |tobs | > t0.975; 14 = 2.1448. > RR RR [1] 2.144787 > TR TR

K14521_SM-Color_Cover.indd 356

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

351

One Sample t-test data: Diff t = -0.25836, df = 14, p-value = 0.7999 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.9301447 0.7301447 sample estimates: mean of x -0.1

The value of the standardized test statistic is tobs =

¯ 0 d−δ √ sD / n D

= −0.2584.

Step 4: Statistical Conclusion — The ℘-value is 2 × P(t29 ≥ −0.2584) = 0.7999.

I. From the rejection region, fail to reject H0 because |tobs | = 0.2584 is less than 2.1448. II. From the ℘-value, fail to reject H0 because the ℘-value 0.7999 is greater than 0.05.

Fail to reject H0 .

Step 5: English Conclusion — There is not sufficient evidence to suggest the mean difference in repair times is not equal to zero. In other words, there is no evidence to suggest one team is better than the other.

To answer whether both teams can complete the mission in less than three days, start by verifying the normality assumption of the data for team1 using exploratory data analysis (eda()).

> with(data = HUBBLE, + eda(team1) + )

K14521_SM-Color_Cover.indd 357

30/06/15 11:49 am

352

Probability and Statistics with R, Second Edition: Exercises and Solutions

EXPLORATORY DATA ANALYSIS Histogram of team1

Density of team1

Boxplot of team1

Q−Q Plot of team1

The results from applying the function eda() to the repair times for team1 suggest it is not unreasonable to assume the repair times for team1 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test team1 repair time is less than 3 days, the hypotheses are H0 : µ = 3 versus H1 : µ < 3 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 2.22 n

x

i = 2.22. The standardized test statistic The value of this test statistic is x ¯ = i=1 n √ 0 ∼ t15−1 . under the assumption that H0 is true and its distribution are X−µ S/ n

Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 14 = −1.7613. > RR RR [1] -1.76131 > TR TR

One Sample t-test data: team1 t = -3.7753, df = 14, p-value = 0.001024 alternative hypothesis: true mean is less than 3 95 percent confidence interval: -Inf 2.583896 sample estimates: mean of x 2.22

The value of the standardized test statistic is tobs =

x ¯−µ √0 s/ n

= −3.7753.

Step 4: Statistical Conclusion — The ℘-value is P(t14 ≤ −3.7753) = 0.001.

I. From the rejection region, reject H0 because tobs = −3.7753 is less than 1.7613. II. From the ℘-value, reject H0 because the ℘-value 0.001 is less than 0.05.

Reject H0 .

Step 5: English Conclusion — There is evidence that the team1 average mission repair time in less than 3 days.

For team2, start by verifying the normality assumption of the data using exploratory data analysis (eda()).

> with(data = HUBBLE, + eda(team2) + )

K14521_SM-Color_Cover.indd 359

30/06/15 11:49 am

354

Probability and Statistics with R, Second Edition: Exercises and Solutions

EXPLORATORY DATA ANALYSIS Histogram of team2

Density of team2

Boxplot of team2

Q−Q Plot of team2

The results from applying the function eda() to the repair times for team2 suggest it is not unreasonable to assume the repair times for team2 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test team2 repair time is less than 3 days, the hypotheses are H0 : µ = 3 versus H1 : µ < 3 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 2.32 n

x

i = 2.32. The standardized test statistic The value of this test statistic is x ¯ = i=1 n √ 0 ∼ t15−1 . under the assumption that H0 is true and its distribution are X−µ S/ n

Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 14 = −1.7613. > RR RR [1] -1.76131 > TR TR

One Sample t-test data: team2 t = -2.6835, df = 14, p-value = 0.008911 alternative hypothesis: true mean is less than 3 95 percent confidence interval: -Inf 2.766309 sample estimates: mean of x 2.32 The value of the standardized test statistic is tobs =

x ¯−µ √0 s/ n

= −2.6835.

Step 4: Statistical Conclusion — The ℘-value is P(t14 ≤ −2.6835) = 0.0089. I. From the rejection region, reject H0 because tobs = −2.6835 is less than 1.7613. II. From the ℘-value, reject H0 because the ℘-value 0.0089 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence that the team2 average mission repair time in less than 3 days. The evidence suggests both teams can complete the mission in less than 3 days. 7. The research and development department of an appliance company suspects the energy consumption required of their 18-cubic-foot refrigerator can be reduced by a slight modification to the current motor. Sixty 18-cubic-foot refrigerators were randomly selected from the company’s warehouse. The first 30 had their motors modified while the last 30 were left intact. The energy consumption (kilowatts) for a 24-hour period for each refrigerator was recorded and stored in the data frame REFRIGERATOR. Is there evidence that the design modification reduces the refrigerators’ average energy consumption? Solution: To solve this problem, start by verifying the reasonableness of the normality assumption. > ggplot(data = REFRIGERATOR, aes(x = group, y = kilowatts, fill = group)) + + geom_boxplot() + + theme_bw() > ggplot(data = REFRIGERATOR, aes(sample = kilowatts, color = group)) + + stat_qq() + + theme_bw()

K14521_SM-Color_Cover.indd 361

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

2.5

2.5

2.0

2.0

kilowatts

group modified original

1.5

1.0

group

sample

356

modified original

1.5

1.0 modified

group

original

−2

−1

0

theoretical

1

2

The side-by-side boxplots and normal quantile-quantile plots suggest it is reasonable to assume the energy consumption for both models follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test to see if the mean energy consumption for modified refrigerators is less than the mean energy consumption for original refrigerators, use a lower one-sided alternative hypothesis. H0 : µmodified − µoriginal = 0 versus H1 : µmodified − µoriginal < 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > Means Means modified original 1.535800 1.760067 The value of this test statistic is 1.5358 − 1.7601 = −0.2243. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY nX + nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 54.7888 = −1.6731. > TR TR

Welch Two Sample t-test

K14521_SM-Color_Cover.indd 362

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

357

data: kilowatts by group t = -2.5128, df = 54.789, p-value = 0.007475 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.07494116 sample estimates: mean in group modified mean in group original 1.535800 1.760067 > RR RR [1] -1.673144 The degrees of freedom are

ν=

s2X nX

(s2X /nX )2 nX −1

+ +

s2Y nY

2

(s2Y /nY )2 nY −1

= 54.7888,

and the value of the standardized test statistic is x ¯ − y¯ − δ0 = −2.5128. tobs = 2 sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is P(t54.7888 ≤ −2.5128) = 0.0075. I. From the rejection region, reject H0 because tobs = −2.5128 is less than 1.6731. II. From the ℘-value, reject H0 because the ℘-value = 0.0075 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest the average energy consumption for modified refrigerators is less than the average energy consumption for unmodified (original) refrigerators.

8. The Yonalasee tennis club has two systems to measure the speed of a tennis ball. The local tennis pros suspects one system (speed1) consistently records faster speeds. To test her suspicions, she sets up both systems and records the speeds of 12 serves (three serves from each side of the court). The values are stored in the data frame TENNIS in the variables speed1 and speed2. The recorded speeds are in kilometers per hour. Does the evidence support the tennis pro’s suspicion? Use α = 0.10. Solution: Note that each system records the same 12 serves. Consequently, the serve times recorded by each system are dependent. Start the analysis by verifying the normality assumption required to use a paired t-test.

K14521_SM-Color_Cover.indd 363

30/06/15 11:49 am

358

Probability and Statistics with R, Second Edition: Exercises and Solutions

> Diff eda(Diff)

EXPLORATORY DATA ANALYSIS Histogram of Diff

Density of Diff

Boxplot of Diff

Q−Q Plot of Diff

The results from applying the function eda() to the differences between speed1 and speed2 suggest it is not unreasonable to assume the serve speed differences between speed1 and speed2 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test the average difference (speed1 - speed2) in recorded speeds, the hypotheses are H0 : µD = 0 versus H1 : µD > 0 Step 2: Test Statistic — The test statistic chosen is D because E D = µD . > dbar dbar [1] -1.329167 > n n [1] 12 The value of this test statistic is d¯ = −1.3292. The standardized test statistic under √0 the assumption that H0 is true and its distribution are SDD−δ / nD ∼ t12−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t11 and H1 is a one-sided hypothesis, the rejection region is tobs > t0.90; 11 = 1.3634.

K14521_SM-Color_Cover.indd 364

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

359

> RR RR [1] 1.36343 > TR TR

One Sample t-test data: Diff t = -0.2804, df = 11, p-value = 0.6078 alternative hypothesis: true mean is greater than 0 95 percent confidence interval: -9.842155 Inf sample estimates: mean of x -1.329167 > # or > with(data = TENNIS, + t.test(speed1, speed2, paired = TRUE, alternative = "greater") + )

Paired t-test data: speed1 and speed2 t = -0.2804, df = 11, p-value = 0.6078 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -9.842155 Inf sample estimates: mean of the differences -1.329167 The value of the standardized test statistic is tobs =

¯ 0 d−δ √ sD / n D

= −0.2804.

Step 4: Statistical Conclusion — The ℘-value is P(t11 ≥ −0.2804) = 0.6078. I. From the rejection region, fail to reject H0 because tobs = −0.2804 is less than 1.3634. II. From the ℘-value, fail to reject H0 because the ℘-value 0.6078 is greater than 0.10. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to suggest the mean difference between speeds is greater than zero.

K14521_SM-Color_Cover.indd 365

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

360

9. An advertising agency is interested in targeting the appropriate gender for a new “lowfat” yogurt. In a national survey of 1200 women, 825 picked the “low-fat” yogurt over a regular yogurt. Meanwhile, 525 out of 1150 men picked the “low-fat” yogurt over the regular yogurt. Given these results, should the advertisements be targeted at a specific gender? Test the appropriate hypothesis at the α = 0.01 level. Solution: To solve this problem, use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the proportion of women who favor low-fat yogurt is the same as the proportion of men who favor low-fat yogurt are H0 : πX = πY versus H1 : πX = πY . In this case, let the random variable X represent the number of females favoring lowfat yogurt, and let the random variable Y represent the number of males favoring low-fat yogurt. Step 2: Test Statistic — The test statistic chosen is PX −PY since E[PX −PY ] = πX −πY . The standardized test statistic under the assumption that H0 is true is Z=

PX − PY 1 P (1 − P ) m + n1

Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution and H1 is a two-sided hypothesis, the rejection region is |zobs | > z0.995 = 2.5758. > RR RR [1] 2.575829 > > > > > >

x m y n p p

ggplot(data = MILKCARTON, aes(x = size, y = seconds, fill = size)) + + geom_boxplot() + + theme_bw() > ggplot(data = MILKCARTON, aes(sample = seconds, color = size)) + + stat_qq() + + theme_bw()

K14521_SM-Color_Cover.indd 368

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

16

16

14

14

halfgallon wholegallon

10

8

6

size

12

sample

seconds

size

12

363

halfgallon wholegallon

10

8

halfgallon

size

wholegallon

6

−2

−1

0

theoretical

1

2

The side-by-side boxplots and normal quantile-quantile plots suggest it is reasonable to assume the drying times for both half gallon and whole gallon containers follow normal distributions; however, it is clear from the boxplot that the variances are very different. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test to see if the mean drying time for half and whole gallon containers is different, use a two-sided alternative hypothesis. H0 : µX − µY = 0 versus H1 : µX − µY = 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > Means Means halfgallon wholegallon 9.98500 12.19525 The value of this test statistic is 9.985 − 12.1952 = −2.2103. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY + nX nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν and H1 is a two-sided hypothesis, the rejection region is |tobs | > t0.95; 46.0992 = 1.6786. > TR TR

Welch Two Sample t-test data:

K14521_SM-Color_Cover.indd 369

seconds by size

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

364

t = -6.5172, df = 46.099, p-value = 4.796e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.892864 -1.527636 sample estimates: mean in group halfgallon mean in group wholegallon 9.98500 12.19525 > RR RR [1] 1.678586 The degrees of freedom are

ν=

s2X nX

(s2X /nX )2 nX −1

+ +

s2Y nY

2

(s2Y /nY )2 nY −1

= 46.0992,

and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −6.5172. sX s2Y + nX nY

Step 4: Statistical Conclusion — The ℘-value is 2 × P(t46.0992 ≥ 6.5172) = 0. I. From the rejection region, reject H0 because |tobs | = 6.5172 is greater than 1.6786. II. From the ℘-value, reject H0 because the ℘-value = 0 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest the average drying times for half and whole gallon containers are not the same.

11. A multinational conglomerate has two textile centers in two different cities. In order to make a profit, each location must produce more than 1000 kilograms of refined wool per day. A random sample of the wool production in kilograms on five different days over the last year for the two locations was taken. The results are stored in the data frame WOOL. Based on the collected data, does the evidence suggest the locations are profitable? Is one location superior to the other? Solution: To see if textileA is profitable, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > woolA with(data = woolA, + eda(production) + )

K14521_SM-Color_Cover.indd 370

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

365

EXPLORATORY DATA ANALYSIS Histogram of production

Density of production

Boxplot of production

Q−Q Plot of production

The results from applying the function eda() to the production of wool suggest it is not unreasonable to assume production of wool for textileA follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if wool production for textileA exceeds 1000 kilograms per day, the hypotheses are H0 : µ = 1 versus H1 : µ > 1 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 1.226 n

x

i The value of this test statistic is x ¯ = i=1 = 1.226. The standardized test n √ 0 ∼ t15−1 . statistic under the assumption that H0 is true and its distribution are X−µ S/ n

Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is an upper one-sided hypothesis, the rejection region is tobs > t0.95; 14 = 1.7613. > RR RR [1] 1.76131

K14521_SM-Color_Cover.indd 371

30/06/15 11:49 am

366

Probability and Statistics with R, Second Edition: Exercises and Solutions > TR TR

One Sample t-test data: production t = 5.1942, df = 14, p-value = 6.801e-05 alternative hypothesis: true mean is greater than 1 95 percent confidence interval: 1.149365 Inf sample estimates: mean of x 1.226

The value of the standardized test statistic is tobs =

x ¯−µ √0 s/ n

= 5.1942.

Step 4: Statistical Conclusion — The ℘-value is P(t14 ≥ 5.1942) = 1e − 04.

I. From the rejection region, reject H0 because tobs = 5.1942 is greater than 1.7613. II. From the ℘-value, reject H0 because the ℘-value = 1e − 04 is less than 0.05.

Reject H0 .

Step 5: English Conclusion — There is evidence to suggest textileA produces more than 1000 kilograms of refined wool per day.

To see if textileB is profitable, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > woolB with(data = woolB, + eda(production) + )

K14521_SM-Color_Cover.indd 372

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

367

EXPLORATORY DATA ANALYSIS Histogram of production

Density of production

Boxplot of production

Q−Q Plot of production

The results from applying the function eda() to the production of wool suggest it is not unreasonable to assume production of wool for textileB follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if wool production for textileB exceeds 1000 kilograms per day, the hypotheses are H0 : µ = 1 versus H1 : µ > 1 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 1.446 n

x

i The value of this test statistic is x ¯ = i=1 = 1.446. The standardized test n √ 0 ∼ t15−1 . statistic under the assumption that H0 is true and its distribution are X−µ S/ n

Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is an upper one-sided hypothesis, the rejection region is tobs > t0.95; 14 = 1.7613. > RR RR [1] 1.76131

K14521_SM-Color_Cover.indd 373

30/06/15 11:49 am

368

Probability and Statistics with R, Second Edition: Exercises and Solutions > TR TR

One Sample t-test data: production t = 5.1386, df = 14, p-value = 7.53e-05 alternative hypothesis: true mean is greater than 1 95 percent confidence interval: 1.293129 Inf sample estimates: mean of x 1.446 The value of the standardized test statistic is tobs =

x ¯−µ √0 s/ n

= 5.1386.

Step 4: Statistical Conclusion — The ℘-value is P(t14 ≥ 5.1386) = 1e − 04. I. From the rejection region, reject H0 because tobs = 5.1386 is greater than 1.7613. II. From the ℘-value, reject H0 because the ℘-value = 1e − 04 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest textileB produces more than 1000 kilograms of refined wool per day. To discover if one textile is superior to the other, start by verifying the reasonableness of the normality assumption. > ggplot(data = WOOL, aes(x = location, y = production, fill = location)) + + geom_boxplot() + + theme_bw() > ggplot(data = WOOL, aes(sample = production, color = location)) + + stat_qq() + + theme_bw() 2.00

1.75

1.75

location

1.50

textileA textileB

1.25

1.00

textileA textileB

1.25

1.00

textileA

K14521_SM-Color_Cover.indd 374

location

1.50

sample

production

2.00

location

textileB

−2

−1

0

theoretical

1

2

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

369

The side-by-side boxplots and normal quantile-quantile plots suggest it may be reasonable to assume the wool production for both textile plants follow normal distributions; however, it is clear from the boxplot that the variances are different. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test to see if the mean wool production for the textile plants is different and the problem does not suggest one textile plant is superior to the other, use a two-sided alternative hypothesis. H0 : µX − µY = 0 versus H1 : µX − µY = 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > Means Means textileA textileB 1.226 1.446 The value of this test statistic is 1.226 − 1.446 = −0.22. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY nX + nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν and H1 is a two-sided hypothesis, the rejection region is |tobs | > t0.95; 20.6186 = 1.7222. > TR TR

Welch Two Sample t-test data: production by location t = -2.266, df = 20.619, p-value = 0.03435 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.42213539 -0.01786461 sample estimates: mean in group textileA mean in group textileB 1.226 1.446 > RR RR [1] 1.722211

K14521_SM-Color_Cover.indd 375

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

370

The degrees of freedom are

ν=

s2X nX

(s2X /nX )2 nX −1

+ +

s2Y nY

2

(s2Y /nY )2 nY −1

= 20.6186,

and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −2.266. sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is 2 × P(t20.6186 ≥ 2.266) = 0.0344. I. From the rejection region, reject H0 because |tobs | = 2.266 is greater than 1.7222. II. From the ℘-value, reject H0 because the ℘-value = 0.0344 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest different mean wool production for the two plants.

12. Use the data frame FERTILIZE, which contains the height in inches for plants in the variable height and the fertilization type in the variable fertilization to (a) Test if the data suggest that the average height of self-fertilized plants is more than 17 inches. (Use α = 0.05.) (b) Compute a one-sided 95% confidence interval for the average height of self-fertilized plants (H1 : µ > 17). (c) Compute the required sample size to obtain a power of 0.90 if µ1 = 18 inches assuming that σ = s. (d) What is the power of the test in part (a) if σ = s and µ1 = 18? Solution: (a) To solve this problem, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > > > + +

K14521_SM-Color_Cover.indd 376

SELF 17 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 17.575 The test statistic is x ¯ = 17.575. The standardized test statistic under the assump√ 0 ∼ t15−1 . tion that H0 is true and its distribution are X−µ S/ n Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 , and H1 is an upper one-sided hypothesis, the rejection region is tobs > t1−0.05; 14 = t0.95; 14 = 1.7613. > RR RR [1] 1.76131 > TR TR

K14521_SM-Color_Cover.indd 377

30/06/15 11:49 am

372

Probability and Statistics with R, Second Edition: Exercises and Solutions One Sample t-test data: SELF$height t = 1.0854, df = 14, p-value = 0.148 alternative hypothesis: true mean is greater than 17 95 percent confidence interval: 16.64196 Inf sample estimates: mean of x 17.575 The value of the standardized test statistic is tobs =

x ¯−µ √0 s/ n

= 1.0854.

Step 4: Statistical Conclusion — The ℘-value is P(t14 ≥ 1.0854) = 0.148. I. From the rejection region, fail to reject H0 because tobs = 1.0854 is less than 1.7613. II. From the ℘-value, fail to reject H0 because the ℘-value = 0.148 is greater than 0.05. Fail to reject H0 . Step 5: English Conclusion — There is insufficient evidence to suggest that the average height of self-fertilized plants is more than 17 inches. (b) > TR TR One Sample t-test data: SELF$height t = 1.0854, df = 14, p-value = 0.148 alternative hypothesis: true mean is greater than 17 95 percent confidence interval: 16.64196 Inf sample estimates: mean of x 17.575 The one-sided 95% confidence interval for the average height of self-fertilized plants is [16.642, ∞]. (c) > POWER n n [1] 38

K14521_SM-Color_Cover.indd 378

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

373

One needs a sample size of at least 38 to obtain a power of at least 0.90. (d)

> POWER POWER One-sample t test power calculation n delta sd sig.level power alternative

= = = = = =

15 1 2.051676 0.05 0.5598609 one.sided

> power power [1] 0.5598609

The power of the test is 0.5599.

13. A manufacturer of lithium batteries has two production facilities. One facility (A) manufactures a battery with an advertised life of 180 hours, while the second facility (B) manufactures a battery with an advertised life of 200 hours. Both facilities are trying to reduce the variance in their products’ lifetimes. Is the variability in battery life equivalent, or does the evidence suggest the facility producing 200-hour batteries has smaller variability than the facility producing 180-hour batteries? Use the data frame BATTERY with α = 0.05 to test the appropriate hypothesis. Solution: Prior to using a test that is very sensitive to departures in normality, density plots and quantile-quantile normal plots are created for both facilities.

> ggplot(data = BATTERY, aes(lifetime, fill = facility)) + + geom_density() + + theme_bw() > ggplot(data = BATTERY, aes(sample = lifetime, color = facility)) + + stat_qq() + + theme_bw()

K14521_SM-Color_Cover.indd 379

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

200

0.15

density

facility

0.10

A B

facility

sample

374

A

190

B

0.05 180

0.00 180

190

lifetime

200

−2

−1

0

theoretical

1

2

Based on the density plots and quantile-quantile normal plots, it seems reasonable to assume the battery life from both facilities follow normal distributions. Therefore, proceed with the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the variability in facility A’s battery life (X) is greater than the variability in facility B’s battery life (Y ) are 2 2 H0 : σX = σY2 versus H1 : σX > σY2 . 2 2 2 and SY2 since E SX = σX Step 2: Test Statistic — The test statistics chosen are SX 2 2 and E SY = σY . > VAR VAR A B 7.539291 4.347130 The values of these test statistics are s2X = 7.5393 and s2Y = 4.3471. The standardized test statistic under the assumption that H0 is true and its distribution are 2 /SY2 ∼ F50−1,50−1 . SX Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed F49,49 , and H1 is an upper one-sided hypothesis, the rejection region is fobs > F0.95; 49,49 = 1.6073. > RR RR [1] 1.607289 > TR TR

K14521_SM-Color_Cover.indd 380

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

375

F test to compare two variances data: lifetime by facility F = 1.7343, num df = 49, denom df = 49, p-value = 0.02836 alternative hypothesis: true ratio of variances is greater than 1 95 percent confidence interval: 1.079031 Inf sample estimates: ratio of variances 1.734314

The value of the standardized test statistic is fobs = (7.5393)/(4.3471) = 1.7343. Step 4: Statistical Conclusion — The ℘-value is P(F49,49 ≥ 1.7343) = 0.0284. I. From the rejection region, reject H0 because fobs = 1.7343 is greater than 1.6073. II. From the ℘-value, reject H0 because the ℘-value = 0.0284 is less than 0.05. Reject H0 .

Step 5: English Conclusion — The evidence suggests the variability of battery life from facility A is greater than the variance for battery life from facility B.

14. In the construction of a safety strobe, a particular manufacturer can purchase LED diodes from one of two suppliers. It is critical that the purchased diodes conform to their stated specifications with respect to diameter since they must be mated with a fixed width cable. The diameter in millimeters for a random sample of 15 diodes from each of the two suppliers is stored in the data frame LEDDIODE. Based on the data, is there evidence to suggest a difference in variabilities between the two suppliers? Use an α level of 0.01. Solution: Prior to using a test that is very sensitive to departures in normality, density plots and quantile-quantile normal plots are created for both suppliers. > ggplot(data = LEDDIODE, aes(diameter, fill = supplier)) + + geom_density(alpha = 0.3) + + theme_bw() > ggplot(data = LEDDIODE, aes(sample = diameter, color = supplier)) + + stat_qq() + + theme_bw()

K14521_SM-Color_Cover.indd 381

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

376 3

5.4

5.2

density

supplier supplierA supplierB

supplier

sample

2

supplierA

5.0

supplierB

1 4.8

0 4.6

4.8

5.0

diameter

5.2

4.6

5.4

−2

−1

0

theoretical

1

2

Based on the density plots and quantile-quantile normal plots, it seems reasonable to assume the LED diode widths from both suppliers follow normal distributions. Therefore, proceed with the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the variability in LED diode widths using supplier A’s (X) diodes is not equal to the variability in LED diode widths using supplier B’s (Y ) diodes are 2 2 = σY2 . = σY2 versus H1 : σX H0 : σX

2 2 2 = σX and SY2 since E SX Step 2: Test Statistic — The test statistics chosen are SX 2 2 and E SY = σY . > VAR VAR supplierA supplierB 0.06495524 0.01506381 The values of these test statistics are s2X = 0.065 and s2Y = 0.0151. The standardized test statistic under the assumption that H0 is true and its distribution are 2 SX /SY2 ∼ F15−1,15−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed F14,14 , and H1 is a two-sided hypothesis, the rejection region is fobs < F0.005; 14,14 = 0.2326 or fobs > F0.995; 14,14 = 4.2993. > RRupper RRlower c(RRlower, RRupper) [1] 0.2325967 4.2992869 > TR TR

K14521_SM-Color_Cover.indd 382

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

377

F test to compare two variances data: diameter by supplier F = 4.312, num df = 14, denom df = 14, p-value = 0.009861 alternative hypothesis: true ratio of variances is not equal to 1 99 percent confidence interval: 1.002958 18.538551 sample estimates: ratio of variances 4.312006 The value of the standardized test statistic is fobs = (0.065)/(0.0151) = 4.312. Step 4: Statistical Conclusion — The ℘-value is P(F14,14 ≥ 4.312) × 2 = 0.0099. I. From the rejection region, reject H0 because fobs = 4.312 is greater than 4.2993. II. From the ℘-value, reject H0 because the ℘-value = 0.0099 is less than 0.01. Reject H0 . Step 5: English Conclusion — The evidence suggests the variability of the width of LED diodes from supplier A is not equal to the variance for the width of LED diodes from supplier B.

15. The technology at a certain computer manufacturing plant allows silicon sheets to be split into chips using two different techniques. In an effort to decide which technique is superior, 28 silicon sheets are randomly selected from the warehouse. The two techniques of splitting the chips are randomly assigned to the 28 sheets so that each technique is applied to 14 sheets. The results from the experiment are stored in the data frame CHIPS. Use α = 0.05, and test the appropriate hypothesis to see if there are differences between the two techniques. The values recorded in CHIPS are the number of usable chips from each silicon sheet. Solution: To solve this problem, start by verifying the reasonableness of the normality assumption. > ggplot(data = CHIPS, aes(number, fill = method)) + + geom_density(alpha = 0.3) + + theme_bw() > ggplot(data = CHIPS, aes(sample = number, color = method)) + + stat_qq() + + theme_bw()

K14521_SM-Color_Cover.indd 383

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

378

500

0.009

450

density

techniqueI techniqueII

400

method

sample

method

0.006

techniqueI techniqueII

350

0.003 300

0.000

250 250

300

350

400

number

450

500

−1

0

theoretical

1

The density plots and normal quantile-quantile plots suggest it is reasonable to assume the number of usable chips from both techniques follow normal distributions; however, it is clear from the density plots that the variances are different. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test if there are differences in the mean number of usable chips generated by the two techniques, use a two-sided alternative hypothesis. H0 : µX − µY = 0 versus H1 : µX − µY = 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > MEANS MEANS techniqueI techniqueII 337.6429 360.0714 The value of this test statistic is 337.6429−360.0714 = −22.4286. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY + nX nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν , and H1 is a two-sided hypothesis, the rejection region is tobs < t0.025; 18.4541 = −2.0972 or tobs > t0.975; 18.4541 = 2.0972. > TR TR

Welch Two Sample t-test

K14521_SM-Color_Cover.indd 384

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

379

data: number by method t = -1.1175, df = 18.454, p-value = 0.2781 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -64.52203 19.66489 sample estimates: mean in group techniqueI mean in group techniqueII 337.6429 360.0714 > RRupper RRlower c(RRlower, RRupper) [1] -2.097223

2.097223

The degrees of freedom are

ν=

s2X nX

(s2X /nX )2 nX −1

+ +

s2Y nY

2

(s2Y /nY )2 nY −1

= 18.4541,

and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −1.1175. sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is 2 × P(t18.4541 ≥ |−1.1175|) = 0.2781. I. From the rejection region, fail to reject H0 because tobs = −1.1175 is greater than -2.0972. II. From the ℘-value, fail to reject H0 because the ℘-value = 0.2781 is greater than 0.05. Fail to reject H0 . Step 5: English Conclusion — There is insufficient evidence to suggest the average number of usable chips from technique I is different from the average number of usable chips from technique II.

16. Phenylketonuria (PKU) is a genetic disorder that is characterized by an inability of the body to utilize an essential amino acid, phenylalanine. Research suggests patients with phenylketonuria have deficiencies in coenzyme Q10. The data frame PHENYL records the level of Q10 at four different times for 46 patients diagnosed with PKU. The variable Q10.1 contains the level of Q10 measured in µM for the 46 patients. Q10.2, Q10.3, and Q10.4 record the values recorded at later times, respectively, for the 46 patients (Artuch et al., 2004). (a) Normal patients have a Q10 reading of 0.69 µM. Using the variable Q10.2, is there evidence that the mean value of Q10 in patients diagnosed with PKU is less than 0.69 µM? (Use α = 0.01.)

K14521_SM-Color_Cover.indd 385

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

380

(b) Patients diagnosed with PKU are placed on strict vegetarian diets. Some have speculated that patients diagnosed with PKU have low Q10 readings because meats are rich in Q10. Is there evidence that the patients’ Q10 level decreases over time? Construct a 99% confidence interval for the mean difference of the Q10 levels using Q10.1 and Q10.4. Solution: (a) To solve this problem, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > Q10.2 eda(Q10.2)

EXPLORATORY DATA ANALYSIS Histogram of Q10.2

Density of Q10.2

Boxplot of Q10.2

Q−Q Plot of Q10.2

The results from applying the function eda() to variable Q10.2 suggest it is not unreasonable to assume that Q10.2 follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if the mean value of Q10.2 is less than 0.69 µM, the hypotheses are H0 : µ = 0.69 versus H1 : µ < 0.69 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 0.5165217 The value of this test statistic is x ¯ = 0.5165. The standardized test statistic under √ 0 ∼ t46−1 . the assumption that H0 is true and its distribution are X−µ S/ n

K14521_SM-Color_Cover.indd 386

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

381

Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t45 , and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.01; 45 = −2.4141. > RR RR [1] -2.412116 > TR TR

One Sample t-test data: Q10.2 t = -6.2405, df = 45, p-value = 6.856e-08 alternative hypothesis: true mean is less than 0.69 95 percent confidence interval: -Inf 0.5632078 sample estimates: mean of x 0.5165217

The value of the standardized test statistic is tobs =

x ¯−µ √0 s/ n

= −6.2405.

Step 4: Statistical Conclusion — The ℘-value is P(t45 ≤ −6.2405) = 0. I. From the rejection region, reject H0 because tobs = −6.2405 is less than 2.4121. II. From the ℘-value, reject H0 because the ℘-value = 0 is less than 0.01. Reject H0 . Step 5: English Conclusion — There is evidence to suggest that the mean Q10.2 level is less than 0.69µM.

(b) Note that the problem is solved by comparing the Q10.1 and Q10.4 values for each subject. Consequently, this question is answered using a paired t-test. Start the analysis by verifying the normality assumption required to use a paired t-test. > Diff eda(Diff)

K14521_SM-Color_Cover.indd 387

30/06/15 11:49 am

382

Probability and Statistics with R, Second Edition: Exercises and Solutions

EXPLORATORY DATA ANALYSIS Histogram of Diff

Density of Diff

Boxplot of Diff

Q−Q Plot of Diff

The results from applying the function eda() to the differences between Q10.1 and Q10.4 suggest it is not unreasonable to assume the Q10 differences between Q10.1 and Q10.4 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if the average difference between Q10.1 and Q10.4 is greater than zero the hypotheses are H0 : µD = 0 versus H1 : µD > 0 Step 2: Test Statistic — The test statistic chosen is D because E D = µD . > dbar dbar [1] 0.1215217 The value of this test statistic is d¯ = 0.1215. The standardized test statistic under √0 the assumption that H0 is true and its distribution are SDD−δ / nD ∼ t46−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t45 and H1 is a upper one-sided hypothesis, the rejection region is tobs > t0.99; 45 = 2.4121. > RR RR [1] 2.412116 > TR TR

K14521_SM-Color_Cover.indd 388

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

383

One Sample t-test data: Diff t = 4.3372, df = 45, p-value = 4.02e-05 alternative hypothesis: true mean is greater than 0 95 percent confidence interval: 0.07446716 Inf sample estimates: mean of x 0.1215217 > # Or > t.test(PHENYL$Q10.1, PHENYL$Q10.4, paired = TRUE, + alternative = "greater")

Paired t-test data: PHENYL$Q10.1 and PHENYL$Q10.4 t = 4.3372, df = 45, p-value = 4.02e-05 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 0.07446716 Inf sample estimates: mean of the differences 0.1215217 The value of the standardized test statistic is tobs =

¯ 0 d−δ √ sD / n D

= 4.3372.

Step 4: Statistical Conclusion — The ℘-value is P(t45 ≥ 4.3372) = 0. I. From the rejection region, reject H0 because tobs = 4.3372 is greater than 2.4121. II. From the ℘-value, reject H0 because the ℘-value 0 is less than 0.01. Reject H0 . Step 5: English Conclusion — There is evidence to suggest Q10 levels decrease over time.

> CI CI [1] 0.04616434 0.19687914 attr(,"conf.level") [1] 0.99 The 99% confidence interval for the mean difference of the Q10 levels is [0.0462, 0.1969].

K14521_SM-Color_Cover.indd 389

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

384

17. According to the Pamplona, Spain, registration, 0.4% of immigrants in 2002 were from Bolivia. In June of 2005, a sample of 3740 registered foreigners was randomly selected. Of these, 87 were Bolivians. Is there evidence to suggest immigration from Bolivia has increased? (Use α = 0.05.) Solution: Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether immigration from Bolivia has increased are H0 : π = 0.004 versus H1 : π > 0.004. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of Bolivian immigrants. Provided H0 is true, Y ∼ Bin(n, π0 ). The value of the test statistic is yobs = 87. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — Likelihood Method: n n i n−i π0 (1 − π0 ) ℘-value = P (Y ≥ yobs | H0 ) = i i=y obs

=

3740 87

=0

3740 3740−i 0.004i (0.996) i

Computed with R

> pvalue pvalue [1] 1.505911e-37 > TR TR

Exact binomial test data: 87 and 3740 number of successes = 87, number of trials = 3740, p-value < 2.2e-16 alternative hypothesis: true probability of success is greater than 0.004 95 percent confidence interval: 0.01935316 1.00000000 sample estimates: probability of success 0.02326203

K14521_SM-Color_Cover.indd 390

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

385

Reject H0 . Step 5: English Conclusion — There is evidence to suggest the proportion of Bolivian immigrants in Pamplona, Spain, has increased.

18. Find the power for the hypothesis H0 : µ = 65 versus H1 : µ > 65 if µ1 = 70 at the α = 0.01 level assuming σ = s for the variable hard in the data frame Rubber of the MASS package. Solution: > > + + + + + >

library(MASS) POWER power power [1] 0.4277995 The power for the hypothesis H0 : µ = 65 versus H1 : µ > 65 if µ1 = 70 at the α = 0.01 level assuming σ = 12.1767 for the variable hard in the data frame Rubber of the MASS package is 0.4278. 19. The director of urban housing in Vitoria, Spain, claims that at least 50% of all apartments have more than one bathroom and that at least 75% of all apartments have an elevator. (a) Can the director’s claim about bathrooms be contradicted? Test the appropriate hypothesis using α = 0.10. Note that the number of bathrooms is stored in the variable toilets in the data frame VIT2005. (b) Can the director’s claim about elevators be substantiated using an α level of 0.10? Use both an approximate method as well as an exact method to reach a conclusion. Are the methods in agreement?

K14521_SM-Color_Cover.indd 391

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

386

(c) Test whether the proportion of apartments built prior to 1980 without garages have a smaller proportion with elevators than without elevators. Solution: (a) Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to contradict the housing director’s claim about the bathrooms are H0 : π = 0.50 versus H1 : π < 0.50. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of apartments that have more than one bathroom. Provided H0 is true, Y ∼ Bin(n, π0 ). > FT FT toilets 1 2 116 102 > yobs n c(yobs, n) 2 102 218 The value of the test statistic is yobs = 102. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — Likelihood Method: ℘-value = P (Y ≤ yobs | H0 ) = =

102 218 0

= 0.1893

i

0.5i (0.5)

y obs i=0

n i n−i π (1 − π0 ) i 0

218−i

Computed with R

> pvalue pvalue [1] 0.1893231

K14521_SM-Color_Cover.indd 392

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

387

> TR TR

Exact binomial test data: yobs and n number of successes = 102, number of trials = 218, p-value = 0.1893 alternative hypothesis: true probability of success is less than 0.5 95 percent confidence interval: 0.000000 0.525842 sample estimates: probability of success 0.4678899 Thus, one fails to reject H0 because 0.1893 is greater than 0.1. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to contradict the claim that at least 50% of all apartments have more than one bathroom. (b) Use the five-step procedure. Exact method: Step 1: Hypotheses — The null and alternative hypotheses to support the housing director’s claim about elevators are H0 : π = 0.75 versus H1 : π > 0.75. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of apartments that have an elevator. > FT FT elevator 0 1 44 174 > yobs n c(yobs, n) 1 174 218 Provided H0 is true, Y ∼ Bin(n, π0 ). The value of the test statistic is yobs = 174. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required.

K14521_SM-Color_Cover.indd 393

30/06/15 11:49 am

388

Probability and Statistics with R, Second Edition: Exercises and Solutions

Step 4: Statistical Conclusion — Likelihood Method: n n i n−i π0 (1 − π0 ) ℘-value = P (Y ≥ yobs | H0 ) = i i=y obs

=

218 174

218 218−i 0.75i (0.25) i

= 0.0564

Computed with R

> pvalue pvalue [1] 0.05643458 > TR TR

Exact binomial test data: yobs and n number of successes = 174, number of trials = 218, p-value = 0.05643 alternative hypothesis: true probability of success is greater than 0.75 95 percent confidence interval: 0.748218 1.000000 sample estimates: probability of success 0.7981651 Thus, one rejects H0 because 0.0564 is less than 0.1. Reject H0 . Step 5: English Conclusion — There is evidence to substantiate the claim of the housing director regarding elevators. Approximate method: Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to substantiate the housing director’s claim regarding elevators are H0 : π = 0.75 versus H1 : π > 0.75. Step 2: Test Statistic — The test statistic chosen is P , where P is the proportion of apartments with elevators. Provided H0 is true, π0 (1 − π0 ) P ∼ N π0 , n

K14521_SM-Color_Cover.indd 394

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

389

and the standardized test statistic is P − π0 Z=

π0 (1−π0 ) n

∼ N (0, 1).

Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution, and H1 is an upper one-sided hypothesis, the rejection region is zobs > z0.9 = 1.2816. > RR RR [1] 1.281552 > TRnoCC TRwiCC TRnoCC

= = = =

n, p = 0.75, "greater", correct = FALSE) n, p = 0.75, "greater", correct = TRUE)

1-sample proportions test without continuity correction data: yobs out of n, null probability 0.75 X-squared = 2.6972, df = 1, p-value = 0.05026 alternative hypothesis: true p is greater than 0.75 95 percent confidence interval: 0.7499209 1.0000000 sample estimates: p 0.7981651 > TRwiCC

1-sample proportions test with continuity correction data: yobs out of n, null probability 0.75 X-squared = 2.4465, df = 1, p-value = 0.05889 alternative hypothesis: true p is greater than 0.75 95 percent confidence interval: 0.7474708 1.0000000 sample estimates: p 0.7981651

The value of the standardized test statistic is

K14521_SM-Color_Cover.indd 395

30/06/15 11:49 am

390

Probability and Statistics with R, Second Edition: Exercises and Solutions Without Continuity Correction p − π0 zobs = =

π0 (1−π0 ) n 174 218 − 0.75

(0.75)(1−0.75) 218

= 1.6423

With Continuity Correction p − π0 + zobs =

1 2n

π0 (1−π0 ) n

OR

174

= 218

− 0.75 +

1 436

(0.75)(1−0.75) 218

= 1.5641

Step 4: Statistical Conclusion — The ℘-value is P(Z ≥ 1.6423) = 0.0503 or P(Z ≥ 1.5641) = 0.0589 for continuity corrections not used and used, respectively. I. From the rejection region, reject H0 because zobs = 1.6423 (no continuity correction) is greater than 1.2816, and zobs = 1.5641 (continuity correction) is greater than 1.2816. II. From the ℘-value, reject H0 because the ℘-value = 0.0503 (without continuity correction) or ℘-value = 0.0589 (with continuity correction) is less than 0.1. Reject H0 . Step 5: English Conclusion — There is evidence to support the housing director’s claim about the percent of apartments with elevators. (c) To solve this problem, use Fisher’s exact test and the five-step procedure. Only Fisher’s Exact Test will be completed to answer the question as the n(1 − π) > 10 condition will not be satisfied for a large sample approximation. Step 1: Hypotheses — The null and alternative hypotheses to test whether the proportion of apartments built prior to 1980 with elevators that do not have garages is less than the proportion of apartments that do not have elevators or garages are H0 : πX = πY versus H1 : πX > πY . In this case, the random variable X will represent the number of apartments built prior to 1980 that do not have garages or elevators, and the random variable Y will represent the number of apartments built prior to 1980 that have an elevator but no garage. Step 2: Test Statistic — The test statistic chosen is X, where X is the number of apartments built prior to 1980 that do not have garages or elevators. > FT 25)) > FT garage elevator 0 1 0 19 0 1 22 4

K14521_SM-Color_Cover.indd 396

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

391

Table 9.1: Apartments built prior to 1980 classified by the presence of a garage and of an elevator Garage NO YES NO 19 = x 0 19 = m Elevators YES 22 4 26 = n 41 = k 4 = N − k 45 = N

The observed value of the test statistic is x = 19. Provided H0 is true, and conditioning on the fact that X + Y = k, X ∼ Hyper (m, n, k). Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — To compute the ℘-value, compute min{m,k} m n min{19,41} 19 26 k−i i i 4541−i = 0.1003 = P(X ≥ x | H0 ) = N i=x

k

i=19

41

> pvalue pvalue [1] 0.1003389 > TR TR

Fisher's Exact Test for Count Data data: FT p-value = 0.1003 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.6878007 Inf sample estimates: odds ratio Inf Since the ℘-value is 0.1003, one fails to reject H0 because 0.1003 is greater than 0.10. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to suggest that the proportion of apartments built prior to 1980 with elevators and no garages is lower

K14521_SM-Color_Cover.indd 397

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

392

than the proportion of apartments built prior to 1980 without elevators and no garages.

20. A rule of thumb used by realtors in Vitoria, Spain, is that each square meter will cost roughly e 3000; however, there is some suspicion that this figure is high for apartments in the 55 to 66 m2 range. Use a 5 m2 bracket, that is, [55, 60) (small) and [60, 65) (medium), to see if evidence exists that the average between the medium and small apartment sizes is less than e 15,000. (a) Use the data frame VIT2005 and the variables totalprice and area to test the appropriate hypothesis at a 5% significance level. (b) Are the assumptions for using a t-test satisfied? Explain. (c) Does the answer for (a) differ if the variances are assumed to be equal? Can the hypothesis of equal variances be rejected? Solution: (a) To solve this problem, start by creating a variable aptsize. Then, verify the reasonableness of the normality assumption for the two apartment sizes. > + + + + > >

VIT2005 + + > > >

Large 193

ggplot(data = VITsub, aes(totalprice, fill = aptsize)) + geom_density(alpha = 0.2) + theme_bw() ggplot(data = VITsub, aes(sample = totalprice, color = aptsize)) + stat_qq() + theme_bw() SmallAptPrice xbar ybar c(xbar, ybar) [1] 205083.3 191531.5 The value of this test statistic is 205083.3333 − 191531.5287 = 13551.8046. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY nX + nY

K14521_SM-Color_Cover.indd 399

30/06/15 11:49 am

394

Probability and Statistics with R, Second Edition: Exercises and Solutions

Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν , and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 16.0129 = −1.7458. > TR TR

Welch Two Sample t-test data: MediumAptPrice and SmallAptPrice t = -0.16284, df = 16.013, p-value = 0.4363 alternative hypothesis: true difference in means is less than 15000 95 percent confidence interval: -Inf 29077.34 sample estimates: mean of x mean of y 205083.3 191531.5 > RR RR [1] -1.745797 The degrees of freedom are

ν=

s2X nX

(s2X /nX )2 nX −1

+ +

s2Y nY

2

(s2Y /nY )2 nY −1

= 16.0129,

and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −0.1628. sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is P(t16.0129 ≤ −0.1628) = 0.4363. I. From the rejection region, fail to reject H0 because tobs = −0.1628 is greater than −1.7458.

II. From the ℘-value, fail to reject H0 because the ℘-value = 0.4363 is greater than 0.05. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to suggest the average totalprice between the medium and small apartment sizes is less than e 15,000.

(b) From the description of the data, it is not clear if they were obtained as a random sample of all apartments in Vitoria; however, it is reasonable to assume the distribution of totalprice for both medium and small apartments follows a normal distribution.

K14521_SM-Color_Cover.indd 400

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

395

(c) > TR TR Two Sample t-test data: MediumAptPrice and SmallAptPrice t = -0.1601, df = 18, p-value = 0.4373 alternative hypothesis: true difference in means is less than 15000 95 percent confidence interval: -Inf 29237.82 sample estimates: mean of x mean of y 205083.3 191531.5 The answer for (a) is the same if variances are assumed to be equal. > var.test(MediumAptPrice, SmallAptPrice) F test to compare two variances data: MediumAptPrice and SmallAptPrice F = 1.1756, num df = 11, denom df = 7, p-value = 0.8594 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.2496275 4.4187049 sample estimates: ratio of variances 1.175613 The hypothesis of equal variance cannot be rejected at the α = 0.05 level. 21. A survey to determine unemployment demographics was administered during the first trimester of 2005 in the Spanish province of Navarra. The numbers of unemployed people according to urban and rural areas and gender follow. Unemployment in Navarra, Spain, Male Female Urban 4734 6161 Rural 3259 4033 Totals 7993 10194

in 2005 Totals 10895 7292 18187

(a) Test to see if there is evidence to suggest that πmale|urban < πfemale|urban at α = 0.05. (b) Use an exact test to see if the evidence suggests πfemale|urban > 0.55. (c) Is there evidence to suggest the unemployment rate for females given that they live in a rural area is greater than 50%? Use α = 0.05 with an exact test to reach a conclusion.

K14521_SM-Color_Cover.indd 401

30/06/15 11:49 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

396

(d) Does evidence suggest that πfemale|urban > πfemale|rural ? Solution: (a) To solve this problem, use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test if there is evidence to suggest that πmale|urban < πfemale|urban are H0 : πX = πY versus H1 : πX < πY . In this case, let the random variable X represent the number males given an urban area, and let the random variable Y represent the number of females given an urban area. Step 2: Test Statistic — The test statistic chosen is PX −PY since E[PX −PY ] = πX −πY . The standardized test statistic under the assumption that H0 is true is Z=

PX − PY 1 P (1 − P ) m + n1

Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution and H1 is a lower one-sided hypothesis, the rejection region is zobs < z0.05 = −1.6449. > RR RR [1] -1.644854 > > > > > >

x m y n p p

TR TR

Exact binomial test data: 6161 and 10895 number of successes = 6161, number of trials = 10895, p-value = 0.0005906 alternative hypothesis: true probability of success is greater than 0.55 95 percent confidence interval: 0.5576192 1.0000000 sample estimates: probability of success 0.5654888 Since the ℘-value is 6e-04, reject H0 . Reject H0 . Step 5: English Conclusion — There is sufficient evidence to suggest the proportion of unemployed females given an urban area is greater than 55%. (c) Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the proportion of unemployed females given a rural area is greater than 50% are H0 : π = 0.50 versus H1 : π > 0.50. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of unemployed females given a rural area. Provided H0 is true, Y ∼ Bin(n, π0 ). The value of the test statistic is yobs = 4033. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — Likelihood Method: > pvalue pvalue [1] 6.48379e-20

K14521_SM-Color_Cover.indd 404

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

399

> TR TR

Exact binomial test data: 4033 and 7292 number of successes = 4033, number of trials = 7292, p-value < 2.2e-16 alternative hypothesis: true probability of success is greater than 0.5 95 percent confidence interval: 0.5434122 1.0000000 sample estimates: probability of success 0.5530719 Since the ℘-value is 0, reject H0 . Reject H0 . Step 5: English Conclusion — There is sufficient evidence to suggest the proportion of unemployed females given a rural area is greater than 50%. (d) To solve this problem, use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test if there is evidence to suggest that πfemale|urban > πfemale|rural are H0 : πX = πY versus H1 : πX > πY . In this case, let the random variable X represent the number of females given an urban area, and let the random variable Y represent the number of females given a rural area. Step 2: Test Statistic — The test statistic chosen is PX −PY since E[PX −PY ] = πX −πY . The standardized test statistic under the assumption that H0 is true is Z=

PX − PY 1 P (1 − P ) m + n1

Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution and H1 is an upper one-sided hypothesis, the rejection region is zobs > z0.95 = 1.6449. > > > > > >

K14521_SM-Color_Cover.indd 405

x y m n p p

RR RR [1] -1.76131 > TR TR

One Sample t-test data: Diff t = -2.5096, df = 14, p-value = 0.0125 alternative hypothesis: true mean is less than 0 95 percent confidence interval: -Inf -9.214292 sample estimates: mean of x -30.90333 The value of the standardized test statistic is tobs = −2.5096.

¯ 0 d−δ √ sD / n D

=

−30.9033−0 √ 47.6925/ 15

=

Step 4: Statistical Conclusion — The ℘-value is P(t14 ≤ −2.5096) = 0.0125. I. From the rejection region, reject H0 because tobs = −2.5096 is less than 1.7613. II. From the ℘-value, reject H0 because the ℘-value = 0.0125 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest that the mean difference between prices for insurances quotes from company A and company B is less than zero. That is, for similar insurance, company A is less expensive than company B.

K14521_SM-Color_Cover.indd 408

30/06/15 11:49 am

Chapter 9:

Hypothesis Testing

403

The owner of the transportation fleet changed his mind when presented with quotes for the same jobs from each insurance company. In the original 100 jobs, the overall pattern of the companies’ insuring could be seen, but a comparison of similar jobs was not clear. With the paired data, a more reasonable comparison could be made. 23. Environmental monitoring is done in many fashions, including tracking levels of different chemicals in the air, underground water, soil, fish, milk, and so on. It is believed that milk cows eating in pastures where gamma radiation from iodine exceeds 0.3 µGy/h in turn leads to milk with iodine concentrations in excess of 3.7 MBq/m3 . Assuming the distribution of iodine in pastures follows a normal distribution with a standard deviation of 0.015 µGy/h, determine the required sample size to detect a 2% increase in baseline gamma radiation (0.3µGy/h) using an α = 0.05 significance level with probability 0.99 or more. Solution: The null and alternative hypotheses to test whether gamma radiation from iodine exceeds 0.3 µGy/h are H0 : µ = 0.3 versus H1 : µ > 0.3. The power of a test is the probability that the null hypothesis is rejected when it is false. Here, Power(µ1 = 0.3 × 1.02) = P (reject H0 | µ1 = 0.3 × 1.02 ) = P X > 95th percentile of a N 0.3,

R is used in an iterative process to discover the value of n = 99. = P X > 95th percentile of a N 0.3, = 0.9902

> > > > > > > > + + + + >

0.015 √ n

0.015 √ 99

µ1 = 0.306 µ1 = 0.306

alpha > + > > + + + +

mu > + + + +

Probability and Statistics with R, Second Edition: Exercises and Solutions

alpha > + + +

K14521_SM-Color_Cover.indd 412

n

K14521_SM-Color_Cover.indd 413

set.seed(3) n > > + + > > > >

set.seed(3) n > >

set.seed(3) n + + > > >

set.seed(21) alpha + + > > >

set.seed(21) alpha > > + +

K14521_SM-Color_Cover.indd 416

mu > > > > > + +

K14521_SM-Color_Cover.indd 418

mu > > + + + +

K14521_SM-Color_Cover.indd 419

alpha > > > + + + > > + + + + + + +

K14521_SM-Color_Cover.indd 421

set.seed(9) nx > + +

K14521_SM-Color_Cover.indd 423

Ratio > + + + + + + >

K14521_SM-Color_Cover.indd 424

Ratio > + +

K14521_SM-Color_Cover.indd 425

nx > + + + + + + >

K14521_SM-Color_Cover.indd 426

nx + + + + + + >

K14521_SM-Color_Cover.indd 427

nx > > >

K14521_SM-Color_Cover.indd 428

set.seed(9) m > > > + + + > > > >

set.seed(9) m > > >

Probability and Statistics with R, Second Edition: Exercises and Solutions

set.seed(9) m pvalue pvalue [1] 0.5

K14521_SM-Color_Cover.indd 437

30/06/15 11:50 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

432

The statistic s = 6 and the ℘-value is 0.5. There is not sufficient evidence to suggest the median satisfaction score is greater than 7. 7. A Mendebaldea real estate agent claims Mendebaldea, Spain, has larger apartments than those in San Jorge, Spain. A San Jorge real estate agent disputes this claim. To resolve the issue, two random samples of the total area of several apartments (given in m2 ) are taken from each community in 2002 and stored in the data frame APTSIZE. Mendebaldea San Jorge

90 75

92 75

90 53

83 78

85 52

105 90

136 78

75

(a) Is there evidence to support the Mendebaldea agent’s claim? (i) Use an exact procedure. (ii) Use an approximate procedure. (b) Find a confidence interval for the median of Mendebaldea minus the median of San Jorge with a confidence level of at least 0.90. Solution: (a) > ggplot(data = APTSIZE) + + geom_density(aes(x = size, fill = location), alpha = 0.3) + + theme_bw()

0.06

0.04

density

location Mendebaldea SanJorge

0.02

0.00 50

75

100

size

125

Based on the densities, the distributional shapes and skews for the two apartments appear to be different; however, due to the small sample sizes (7 and 8), it is very hard to reject the null

K14521_SM-Color_Cover.indd 438

30/06/15 11:50 am

Chapter 10:

Nonparametric Methods

433

hypothesis that the two cdfs are the same. Consequently, one might assume the distributions are similar and proceed with a Wilcoxon rank-sum test procedure. An alternative approach might be to perform a permutation test. (i) Exact procedure: > library(coin) Loading required package: survival Attaching package: ’coin’ The following object is masked by ’.GlobalEnv’: alpha > wilcox_test(size ~ location, data = APTSIZE, distribution = "exact", + alternative = "greater") Exact Wilcoxon Mann-Whitney Rank Sum Test data: size by location (Mendebaldea, SanJorge) Z = 2.9167, p-value = 0.001088 alternative hypothesis: true mu is greater than 0 > > > >

SJ obsDiff obsDiff Mendebaldea 25.28571 > > > > > + + + + + > >

Size ggplot(data = USJudgeRatings, aes(x = (INTG - DMNR))) + + geom_density(fill = "pink") + + theme_bw() 1.2

density

0.9

0.6

0.3

0.0 0.0

0.5

1.0

(INTG − DMNR)

1.5

2.0

Since the density plot of INTR − DMNR is skewed to the right, use the sign test to test if lawyers are more likely to give a judge high integrity ratings rather than high demeanor ratings. > Dif SIGN.test(Dif, md = 0, alternative = "greater") One-sample Sign-Test data: Dif s = 41, p-value = 4.552e-13 alternative hypothesis: true median is greater than 0 95 percent confidence interval: 0.2563989 Inf sample estimates: median of x 0.4 Conf.Level L.E.pt U.E.pt Lower Achieved CI 0.9369 0.3000 Inf Interpolated CI 0.9500 0.2564 Inf Upper Achieved CI 0.9670 0.2000 Inf Based on the small ℘-value, reject the null hypothesis. The evidence suggests lawyers are more likely to give a judge high integrity ratings rather than high demeanor ratings. (b)

K14521_SM-Color_Cover.indd 443

30/06/15 11:50 am

438

Probability and Statistics with R, Second Edition: Exercises and Solutions

> SIGN.test(Dif, md = 0, conf.level = 0.90) One-sample Sign-Test data: Dif s = 41, p-value = 9.104e-13 alternative hypothesis: true median 90 percent confidence interval: 0.2563989 0.4436011 sample estimates: median of x 0.4 Conf.Level L.E.pt Lower Achieved CI 0.8737 0.3000 Interpolated CI 0.9000 0.2564 Upper Achieved CI 0.9340 0.2000

is not equal to 0

U.E.pt 0.4000 0.4436 0.5000

An interpolated 90% confidence interval for the median differences (INTR − DMNR) is [0.2564, 0.4436]. 10. A company manager is studying the possibility of giving 20 minutes of rest to her employees in a resting room. To check the viability of this proposal, she analyzed 12 random days of productivity where employees took 20 minutes of rest and 12 random days where they did not. The groups’ productivity scores are given in the following table where higher scores represent greater productivity.

With Rest Without Rest

9 7

8 9

8 5

7 6

6 7

7 3

8 9

9 9

7 4

7 5

7 6

6 4

Is there evidence to suggest that taking a rest produces an increase in group productivity? Answer based on the results from a (a) Wilcoxon signed-rank test, (b) t-test, and a (c) Permutation test. Solution: The function eda() is used on the productivity differences. > > > >

rest obsMdif obsMdif [1] 1.25 > > > + + + > >

sims > + + > >

means library(coin) > oneway_test(Time ~ Company, distribution = "exact", + alternative = "less", data = DF) Exact 2-Sample Permutation Test data: Time by Company (American, Japanese) Z = -1.7756, p-value = 0.04365 alternative hypothesis: true mu is less than 0 Note that the ℘-values for (a) and (b) agree. (c) > library(boot) Attaching package: ’boot’ The following object is masked from ’package:survival’: aml The following object is masked from ’package:lattice’: melanoma > > + + + + > > >

set.seed(12) meandiff wilcox.test(extra ~ group, paired = TRUE, data = sleep, correct = FALSE) Warning in wilcox.test.default(x = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, : cannot compute exact p-value with ties Warning in wilcox.test.default(x = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, : cannot compute exact p-value with zeroes Wilcoxon signed rank test data: extra by group V = 0, p-value = 0.007632 alternative hypothesis: true location shift is not equal to 0 Based on the small ℘-value (0.0076), reject the null hypothesis. The evidence suggests the mean difference between the drugs is not zero. (b) > obsMdif obsMdif [1] -1.58 > > > + + + > >

sims > + + + > >

sims ggplot(data = CIRCUIT, aes(sample = lifetime, shape = design, + color = design)) + + stat_qq() + + theme_bw()

K14521_SM-Color_Cover.indd 454

30/06/15 11:50 am

Chapter 10:

Nonparametric Methods

449

0.8 Design1

0.6 0.4 0.2 0.0 0.8

Design2

0.6 0.4

4 design Design1

sample

0.2

5

3

0.0 0.8 Design3

0.6 0.4 0.2

Design2 Design3 Design4

2

0.0 0.8 Design4

0.6 0.4 0.2 0.0

0

1

2

3

lifetime

4

5

1

0 −1.5

−1.0

−0.5

0.0

theoretical

0.5

1.0

1.5

The density plots and quantile-quantile normal plots make normality questionable; however, ruling out normality with so few observations is difficult. (b) > TR TR Kruskal-Wallis rank sum test data: lifetime by design Kruskal-Wallis chi-squared = 10.245, df = 3, p-value = 0.0166 The ℘-value from the Kruskal-Wallis test of 0.0166 suggests differences exist among the mean lifetimes of different circuit designs. (c) > > > > > + + + > >

N + + + + +

library(MASS) airquality$Month ggplot(data = airquality, aes(sample = Ozone, shape = Month, + color = Month)) + + stat_qq() + + theme_bw()

150

Month 5

sample

100

6 7 8 9

50

0 −2

−1

0

theoretical

1

2

The curvature in the Q-Q normal plots in addition to the density plots and boxplots suggest the distribution of ozone of each month is not normally distributed. (d) > TR TR Kruskal-Wallis rank sum test data: Ozone by Month Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06 The Kruskal-Wallis test ℘-value of 0 suggests the mean ozone level is not the same for all the months; however, the assumption of similar shapes for the distribution of ozone for each month was questionable. (e) > > > > > + +

K14521_SM-Color_Cover.indd 457

N TR TR Pearson's Chi-squared test data: FT X-squared = 28.102, df = 2, p-value = 7.901e-07 The ℘-value of 0 suggest there is an association between wool and tension. 18. The music industry wants to know if the musical style on a CD influences how many illegal copies of it are sold. To achieve this purpose, the company chooses six cities randomly and writes down the number of illegal CDs available on the street categorized by music type: classical music, flamenco, heavy metal, and pop-rock. The data are shown in the following table.

K14521_SM-Color_Cover.indd 458

30/06/15 11:50 am

Chapter 10:

Nonparametric Methods

453

Musical Style City City City City City City City

1 2 3 4 5 6

Classical

Flamenco

Heavy Metal

Pop-Rock

4 3 2 5 2 9

1 4 1 3 3 1

6 5 8 2 6 2

9 10 14 7 14 6

(a) Create boxplots and density plots of the number of illegal CDs available for each music style. (b) Are the distribution shapes similar? (c) Are there significant differences in the numbers of CDs available according to musical style?

Solution: (a) > + + + + + > + > + > > > + + + + + + > + + + + +

K14521_SM-Color_Cover.indd 459

number + > > > >

number > > > > > >

DFT TRaii TRaii Pearson's Chi-squared test data: TCSmales X-squared = 33.033, df = 2, p-value = 6.714e-08 > TRaiii TRaiii Pearson's Chi-squared test data: TCSfemales X-squared = 115.7, df = 2, p-value < 2.2e-16 Evidence suggests there are associations between class and survival for all of the passengers grouped together (℘-value = 0), men separately (℘-value = 0), and women evaluated separately (℘-value = 0).

K14521_SM-Color_Cover.indd 472

30/06/15 11:50 am

Chapter 10:

Nonparametric Methods

467

27. Mental inpatients in the Virgen del Camino Hospital (Pamplona, Spain) are interviewed by expert psychiatrists to diagnose their illnesses. An important aspect in diagnosis is determining the severity of any delusions a patient might suffer. A new questioning technique has been developed to detect the presence of delusions. The technique assigns a score from 0 to 5, where 5 indicates the presence of strong delusions and 0 indicates no delusions. The psychiatrists wish to know if the new technique actually results in high scores for patients who have previously been diagnosed as suffering from severe delusions. The scores that follow were obtained from randomly selected patients who were known to suffer from delusions and those who were known not to suffer with delusions: Delusions Present Absent

Score 5 1

5 0

4 5

5 0

4 4

5 4

5 0

Do the data provide evidence that the new test yields higher scores for those patients who are known to suffer from delusions than for those who do not suffer from delusions? Solution: Due to the small sample sizes and the discrete nature of the scores, a permutation test is used to see if the new test yields higher scores for patients who are known to suffer from delusions than patients that do not suffer from delusions. > > > > > > > > +

present

Score B set.seed(10) > n > > + + >

Nonparametric Methods

469

xbar + + + + > > >

library(boot) MEAN > > > > +

K14521_SM-Color_Cover.indd 475

B sd(boot100$t) [1] 0.09386073 > PEboot100 PEboot100 [1] 6.139268 > > > > > > + + >

B PEboot1000 PEboot1000 [1] 0.9157565 The percent difference from what the standard error should be decreases as the sample size increases. 30. The “Wisconsin Card Sorting Test” is widely used by psychiatrists, neurologists, and neuropsychologists with patients who have a brain injury, neurodegenerative disease, or a mental illness such as schizophrenia. Patients with any sort of frontal lobe lesion generally do poorly on the test. The data frame WCST and the following table contain the test scores from a group of 50 patients from the Virgen del Camino Hospital (Pamplona, Spain). 23 12 31 8 7 28 25 17 19 42 17 6

19 11 36 94 6 10 22 8 20 47 5 13 28 19 8 6 11 10 19 65 13

7 18 26 35 78 11 7 19 38 8 15 40 17 5 26 15 4

(a) Use the function eda() from the PASWR2 package to explore the data and decide if normality can be assumed. (b) What assumption(s) must be made to compute a 95% confidence interval for the population mean? (c) Compute the confidence interval from (b). (d) Compute a 95% BCa bootstrap confidence interval for the mean test score. (e) Should you use the confidence interval reported in (c) or the confidence interval reported in (d)? Solution: (a) Assuming the variable score has a normal distribution is not reasonable.

K14521_SM-Color_Cover.indd 477

30/06/15 11:50 am

472

Probability and Statistics with R, Second Edition: Exercises and Solutions

> with(data = WCST, + eda(score)) Size (n) 50.000 Max 94.000 SW p-val 0.000

Missing 0.000 Stdev 18.406

Minimum 4.000 Var 338.785

1st Qu 8.500 SE Mean 2.603

Mean 21.480 I.Q.R. 17.500

Median TrMean 3rd Qu 17.000 19.413 26.000 Range Kurtosis Skewness 90.000 4.511 2.033

EXPLORATORY DATA ANALYSIS Histogram of score

Density of score

Boxplot of score

Q−Q Plot of score

(b) In order to construct a 95% confidence interval for the population mean, one assumes that the values in the variable score are taken from a normal distribution. Although this is not a reasonable assumption, the sample size might be sufficiently large to overcome the skewness in the parent population. Consequently, one might appeal to the Central Limit Theorem and claim that the sampling distribution of X is approximately normal due to the sample size (50). In this problem, the skewness is quite severe, and one should not be overly confident in the final interval. (c) > CI CI [1] 16.24904 26.71096 attr(,"conf.level") [1] 0.95 The confidence interval is [16.249, 26.711]. (d)

K14521_SM-Color_Cover.indd 478

30/06/15 11:50 am

Chapter 10: > > + + + > > > > >

Nonparametric Methods

473

library(boot) # use boot package MEAN + >

K14521_SM-Color_Cover.indd 480

set.seed(12) sims

475

select = score, drop = TRUE) for(i in 1:sims){ SC1 >

Nonparametric Methods

2.775

library(boot) MDS >

set.seed(1) block > >

set.seed(1) factor1 > > + > > >

Experimental Design

479

dof > > + > > >

dof resid_matrix resid_matrix [,1] [,2] [,3] [,4] [1,] 3.403931e-15 -1.000000e+00 1.00 -1.735490e-15 [2,] -1.250000e+00 -2.500000e-01 0.75 7.500000e-01 [3,] 1.000000e+00 1.241337e-16 -1.00 1.241337e-16 > Check Check [1,] [2,] [3,]

[,1] [,2] [,3] [,4] 3 2 4 3 2 3 4 4 7 6 5 6



   3243 4.08 4.08 4.08 4.08  2 3 4 4  =  4.08 4.08 4.08 4.08  7656 4.08 4.08 4.08 4.08     −1.08 −1.08 −1.08 −1.08 0 −1 1 0 +  −0.83 −0.83 −0.83 −0.83  +  −1.25 −0.25 0.75 0.75  . 1.92 1.92 1.92 1.92 1 0 −1 0

The estimate of the error variance is the MSE = 0.75. (d)

> checking.plots(model.aov)

K14521_SM-Color_Cover.indd 488

30/06/15 11:51 am

Chapter 11:

Experimental Design

2 2

4

6

8

10

1 0

12

−2

2 −1

0

1

2

Theoretical Quantiles

Standardized residuals versus fitted values for model.aov

Density plot of standardized residuals for model.aov

2

0.30

ordered values

3.0

5

0.20 0.00

2

0.10

−1

0

Density

1

9

−2

standardized residuals

5

−2

5

9

−1

−1

0

1

standardized residuals

9

−2

standardized residuals

2

Normal Q−Q plot of standardized residuals from model.aov

2

Standardized residuals versus ordered values for model.aov

483

3.5

4.0

4.5

5.0

5.5

6.0

fitted values

−3

−2

−1

0

1

2

3

N = 12 Bandwidth = 0.5719

Yes, the assumptions are satisfied for the model in part (a). Specifically, no discernible pattern is seen in the top left graph that would threaten the assumption of independence. The top and bottom right graphs suggest the assumption of normality for the errors is a reasonable assumption. The bottom left graph makes the assumption of constant variance appear reasonable. (e) > barley.mc barley.mc Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ barley, data = DF) $barley diff lwr upr p adj B-A 0.25 -1.459747 1.959747 0.9130865 C-A 3.00 1.290253 4.709747 0.0021919 C-B 2.75 1.040253 4.459747 0.0038641 > plot(barley.mc)

K14521_SM-Color_Cover.indd 489

30/06/15 11:51 am

484

Probability and Statistics with R, Second Edition: Exercises and Solutions

C−B

C−A

B−A

95% family−wise confidence level

−1

0

1

2

3

4

Differences in mean levels of barley

C (SULTANE) is significantly higher than both A (ASPEN) and B (ERIKA) but A (ASPEN) is not significantly different from B (ERIKA). (f) > CO colnames(CO) CO C1 C2 A -1 -1 B 1 -1 C 0 2 > TR TR Df Sum Sq Mean Sq F value Pr(>F) C(barley, CO, 1) 1 0.125 0.125 0.167 0.692633 C(barley, CO, 2) 1 22.042 22.042 29.389 0.000421 *** Residuals 9 6.750 0.750 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Based on the ℘-value (4e-04), evidence suggests there is a difference between SULTANE and the other two varieties ASPEN and ERIKA. 7. As described in Basic Statistics and Data Analysis, Car and Driver (July 1995) conducted tests of five cars from five different countries: Japan’s Acura NSXT, Italy’s Ferrari

K14521_SM-Color_Cover.indd 490

30/06/15 11:51 am

Chapter 11:

Experimental Design

485

F355, Great Britain’s Lotus Esprit S4S, Germany’s Porsche 911 Turbo, and the United States’ Dodge Viper RT/10. The maximum speeds the cars obtained in miles per hour using as much distance as necessary without exceeding the engine’s redline are given:

Acura

Ferrari

Lotus

Porsche

Viper

159.7 161.5 163.7 166.0 157.7 161.7

179.6 173.9 180.2 183.9 176.7 178.4

167.4 163.0 160.3 164.9 160.5 158.3

173.5 182.4 171.3 175.7 179.1 175.0

172.3 168.9 169.5 174.6 161.1 164.2

Data from Kitchens (2003, page 512). (a) What statistical model should be used to analyze this experiment? (b) Conduct an analysis of variance to investigate if differences exist among the maximum speeds of the cars. (c) Use appropriate diagnostic measures to check the adequacy of the model from part (a). (d) What is the mean squared error value for the model from part (a)? (e) Use Tukey’s multiple comparison test to determine which of the cars are different according to speed. Plot the confidence intervals for the mean differences.

Solution: (a) A complete randomized design such as Yij = µ + τi + εij

i = 1, 2, 3, 4, 5,

j = 1, . . . , 6,

εij ∼ N (0, σ)

should be used to analyze the experiment. Before proceeding with formal inferential procedures, the data are examined with the function oneway.plots(). > speed car oneway.plots(Y = speed, fac1 = car)

K14521_SM-Color_Cover.indd 491

30/06/15 11:51 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

486

Acura Ferrari Lotus Porsche Viper

160

165

170

175

180

160

180 175

mean of Y

175

180

Ferrari Porsche

Viper

160

160

165

170

180 175 170

170 speed

speed

165

165

Acura

Ferrari

Lotus Porsche Viper car

Lotus Acura fac1 Main Factor

Based on the output from oneway.plots(), one can see the fastest speeds have been recorded by Ferrari and Porsche, while the slowest speeds have been recorded by Acura and Lotus. (b) > > > > >

DF car.mc car.mc Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = speed ~ car, data = DF) $car Ferrari-Acura Lotus-Acura

K14521_SM-Color_Cover.indd 493

diff 17.0666667 0.6833333

lwr upr p adj 10.6142478 23.519086 0.0000004 -5.7690855 7.135752 0.9978264

30/06/15 11:51 am

488

Probability and Statistics with R, Second Edition: Exercises and Solutions

Porsche-Acura 14.4500000 7.9975811 20.902419 0.0000064 Viper-Acura 6.7166667 0.2642478 13.169086 0.0384079 Lotus-Ferrari -16.3833333 -22.8357522 -9.930914 0.0000008 Porsche-Ferrari -2.6166667 -9.0690855 3.835752 0.7563408 Viper-Ferrari -10.3500000 -16.8024189 -3.897581 0.0006917 Porsche-Lotus 13.7666667 7.3142478 20.219086 0.0000137 Viper-Lotus 6.0333333 -0.4190855 12.485752 0.0749333 Viper-Porsche -7.7333333 -14.1857522 -1.280914 0.0132379 > > > >

opar DF TR TR Paired t-test data: yield by site t = -11.159, df = 9, p-value = 1.426e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -28.92944 -19.17723 sample estimates: mean of the differences -24.05333 Based on the ℘-value = 0, reject H0 and conclude that the mean of the differences is not equal to zero. (b) In this setting, variety is a block. > TR TR Df Sum Sq Mean Sq F value Pr(>F) site 1 2892.8 2892.8 124.523 1.43e-06 *** variety 9 313.5 34.8 1.499 0.278 Residuals 9 209.1 23.2 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Based on the small ℘-value = 0, reject H0 and conclude that the mean of the differences is not equal to zero. (c) When one uses as many blocks with a RCBD design with two treatments as there are paired observations with a two-sided paired t-test, the ℘-value for both procedures will be the same. 9. An insurance company wants to know how its resources are being used with respect to time spent issuing travel insurance policies. The company randomly selects three moments during a day and records the time required to issue a travel insurance policy to three randomly selected clients who take out a travel policy over the phone, over the Internet, and in person. The data obtained (in minutes) are telephone Internet in person

3.49 2.38 2.09 4.38 6.68 5.37 7.91 8.70 8.54

(a) What type of design structure did the company use?

K14521_SM-Color_Cover.indd 495

30/06/15 11:51 am

490

Probability and Statistics with R, Second Edition: Exercises and Solutions

(b) Propose a statistical model to analyze the data.

(c) Comment on any assumptions that need to be made with the model selected in part (a). Check these assumptions.

(d) Test to see if differences exist among the methods used to issue insurance policies.

(e) Estimate the model’s parameters.

(f) How is the standard deviation of the errors estimated?

(g) Write the estimated model in matrix form.

(h) Do the residuals sum to zero?

(i) Use Tukey’s HSD to determine if significant differences exist among methods.

(j) Create a barplot of the mean times, and display the standard errors over their respective means.

Solution: (a) The design structure is a completely randomized design (CRD). (b) Yij = µ + τi + εij

i = 1, 2, 3

j = 1, 2, 3

εij ∼ N (0, σ)

(c) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function. > > + > > > >

K14521_SM-Color_Cover.indd 496

time MSE MSE [1] 0.6838333 > sde sde [1] 0.8269422 An estimate of the standard deviation of the errors is the square root of the mean squared error (0.8269). (g) > EFF EFF 1 2 3 4 5 6 7 8 9

(Intercept) 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444

treatment -2.85111111 -2.85111111 -2.85111111 -0.02777778 -0.02777778 -0.02777778 2.87888889 2.87888889 2.87888889

Residuals 0.8366667 -0.2733333 -0.5633333 -1.0966667 1.2033333 -0.1066667 -0.4733333 0.3166667 0.1566667

> MeanMat MeanMat [,1] [,2] [,3] [1,] 5.504444 5.504444 5.504444 [2,] 5.504444 5.504444 5.504444 [3,] 5.504444 5.504444 5.504444 > TreatMat TreatMat [,1] [,2] [,3] [1,] -2.85111111 -2.85111111 -2.85111111 [2,] -0.02777778 -0.02777778 -0.02777778 [3,] 2.87888889 2.87888889 2.87888889

K14521_SM-Color_Cover.indd 498

30/06/15 11:51 am

Chapter 11:

Experimental Design

493

> ResidMat ResidMat [,1] [,2] [,3] [1,] 0.8366667 -0.2733333 -0.5633333 [2,] -1.0966667 1.2033333 -0.1066667 [3,] -0.4733333 0.3166667 0.1566667 > Values Values [,1] [,2] [,3] [1,] 3.49 2.38 2.09 [2,] 4.38 6.68 5.37 [3,] 7.91 8.70 8.54 (h) The residuals sum to zero. > sum(resid(insurance.aov)) [1] -1.387779e-16 > # Or > sum(ResidMat) [1] -1.387779e-16 (i) > TukeyHSD(insurance.aov) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = time ~ treatment, data = DF) $treatment diff lwr upr p adj internet-in person -2.906667 -4.978352 -0.8349817 0.0119941 telephone-in person -5.730000 -7.801685 -3.6583151 0.0003595 telephone-internet -2.823333 -4.895018 -0.7516484 0.0137052 The three methods of issuing insurance are all significantly different. (j) > library(plyr) > mdf mdf treatment MeanTreat SE 1 in person 8.383333 0.2411316

K14521_SM-Color_Cover.indd 499

30/06/15 11:51 am

494

Probability and Statistics with R, Second Edition: Exercises and Solutions

2 internet 3 telephone

5.476667 0.6660914 2.653333 0.4266276

> ggplot(data = mdf, aes(x = treatment, y = MeanTreat, fill = treatment)) + + geom_bar(stat = "identity") + + geom_errorbar(aes(ymin = MeanTreat - SE, ymax = MeanTreat + SE), + width = 0.25) + + guides(fill = FALSE) + + labs(x = "", y = "Mean Time to Issue Policy (in minutes)", + title = "Mean Time to Issue Policy \n with Individual Standard Errors") + + theme_bw() Mean Time to Issue Policy with Individual Standard Errors

Mean Time to Issue Policy (in minutes)

7.5

5.0

2.5

0.0 in person

internet

telephone

10. A health-conscious pizza parlor is attempting to specify the added calories for each ingredient of its medium size pizza. Specifically, the pizza parlor wants to know if there is more variability in an olive topping due to olive suppliers or due to the olives themselves. From numerous suppliers, four are selected randomly and the calories for a pizza topping of olives are recorded for five randomly selected pizzas. The data obtained are given in the following table: Supplier Supplier Supplier Supplier

1 2 3 4

133 124 127 150

136 137 126 141

142 125 130 155

135 132 120 150

134 131 123 157

(a) Specify a statistical model to analyze these data. (b) Conduct an ANOVA.

K14521_SM-Color_Cover.indd 500

30/06/15 11:51 am

Chapter 11:

Experimental Design

495

(c) Estimate the variance components and the total variability of the data. (d) Interpret the results. Solution: (a) The model is Yij = µ + τi + εij ,

τi ∼ N (0, σα ),

εij ∼ N (0, σ)

(b) > + + + > + > > > > >

calories >

MST > + + + + +

K14521_SM-Color_Cover.indd 502

pulpbright > > >

MST > > > > > > >

1 n

·

MST f1−α/2; a−1,N −a MSE

− 1 and U =

1 n

·

MST fα/2; a−1,N −a MSE

dfn model.tables(appliance.aov, type = "effects") Tables of effects cycle cycle Prewash -5.888

Short -3.037

Medium 2.993

Long 5.933

machine machine Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 -0.435 -11.160 5.490 9.615 -3.510 > EFF EFF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

(Intercept) 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975

cycle -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325

machine Residuals -0.435 0.375 -0.435 2.025 -0.435 -0.855 -0.435 -1.545 -11.160 -1.200 -11.160 -0.900 -11.160 0.570 -11.160 1.530 5.490 -0.900 5.490 -1.800 5.490 2.220 5.490 0.480 9.615 0.075 9.615 -0.825 9.615 -0.855 9.615 1.605 -3.510 1.650 -3.510 1.500 -3.510 -1.080 -3.510 -2.070

> GMmat GMmat [,1] [,2] [,3] [,4] [1,] 21.3975 21.3975 21.3975 21.3975 [2,] 21.3975 21.3975 21.3975 21.3975

K14521_SM-Color_Cover.indd 509

30/06/15 11:51 am

504

Probability and Statistics with R, Second Edition: Exercises and Solutions

[3,] 21.3975 21.3975 21.3975 21.3975 [4,] 21.3975 21.3975 21.3975 21.3975 [5,] 21.3975 21.3975 21.3975 21.3975 > CYCLEmat CYCLEmat

[1,] [2,] [3,] [4,] [5,]

[,1] -5.8875 -5.8875 -5.8875 -5.8875 -5.8875

[,2] -3.0375 -3.0375 -3.0375 -3.0375 -3.0375

[,3] 2.9925 2.9925 2.9925 2.9925 2.9925

[,4] 5.9325 5.9325 5.9325 5.9325 5.9325

> MACHINEmat MACHINEmat [,1] [,2] [,3] [,4] [1,] -0.435 -0.435 -0.435 -0.435 [2,] -11.160 -11.160 -11.160 -11.160 [3,] 5.490 5.490 5.490 5.490 [4,] 9.615 9.615 9.615 9.615 [5,] -3.510 -3.510 -3.510 -3.510 > RESIDUALmat RESIDUALmat [,1] [,2] [,3] [,4] [1,] 0.375 2.025 -0.855 -1.545 [2,] -1.200 -0.900 0.570 1.530 [3,] -0.900 -1.800 2.220 0.480 [4,] 0.075 -0.825 -0.855 1.605 [5,] 1.650 1.500 -1.080 -2.070 > VALUES VALUES

[1,] [2,] [3,] [4,] [5,]

[,1] 15.45 3.15 20.10 25.20 13.65

[,2] 19.95 6.30 22.05 27.15 16.35

[,3] 23.10 13.80 32.10 33.15 19.80

[,4] 25.35 17.70 33.30 38.55 21.75

> matrix(apply(EFF, 1, sum), byrow = TRUE, nrow = 5)

[1,] [2,] [3,] [4,] [5,]

K14521_SM-Color_Cover.indd 510

[,1] 15.45 3.15 20.10 25.20 13.65

[,2] 19.95 6.30 22.05 27.15 16.35

[,3] 23.10 13.80 32.10 33.15 19.80

[,4] 25.35 17.70 33.30 38.55 21.75

30/06/15 11:51 am

Chapter 11:

Experimental Design

505

> xtabs(time ~ machine + cycle, data = DF) machine Machine Machine Machine Machine Machine

cycle Prewash 1 15.45 2 3.15 3 20.10 4 25.20 5 13.65

Short Medium Long 19.95 23.10 25.35 6.30 13.80 17.70 22.05 32.10 33.30 27.15 33.15 38.55 16.35 19.80 21.75

> apply(EFF[,2:4], 2, sum) cycle -4.884981e-15

machine 4.440892e-15

# Check that constraints sum to zero Residuals 2.220446e-16

The constraints for this model are satisfied since the sum of the estimated parameters for τi , βj , and εi,j are all zero. Note that the object EFF has estimates of µ, τi , βj , and εi,j under the headings (Intercept), cycle, machine, and Residuals, respectively. (i) > TukeyHSD(appliance.aov, which = "machine") Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = time ~ cycle + machine, data = DF) $machine Machine Machine Machine Machine Machine Machine Machine Machine Machine Machine

2-Machine 3-Machine 4-Machine 5-Machine 3-Machine 4-Machine 5-Machine 4-Machine 5-Machine 5-Machine

diff lwr upr 1 -10.725 -14.6234199 -6.8265801 1 5.925 2.0265801 9.8234199 1 10.050 6.1515801 13.9484199 1 -3.075 -6.9734199 0.8234199 2 16.650 12.7515801 20.5484199 2 20.775 16.8765801 24.6734199 2 7.650 3.7515801 11.5484199 3 4.125 0.2265801 8.0234199 3 -9.000 -12.8984199 -5.1015801 4 -13.125 -17.0234199 -9.2265801

p adj 0.0000119 0.0030033 0.0000232 0.1514411 0.0000001 0.0000000 0.0003323 0.0364514 0.0000704 0.0000014

All machines are significantly different from one another with the exceptions of Machine 5 and Machine 1 as well as Machine 4 and Machine 3. (j) Recall that there are (a − 1) orthogonal contrasts for a treatments. In this case, since there are 5 washing machines, there are are 4 orthogonal contrasts. > > > > > >

K14521_SM-Color_Cover.indd 511

contrasts(DF$machine)[ , 1] contrasts(DF$machine)[ , 2] contrasts(DF$machine)[ , 3] contrasts(DF$machine)[ , 4] CO TR TR C(machine, CO, C(machine, CO, C(machine, CO, C(machine, CO, cycle Residuals --Signif. codes:

1) 2) 3) 4)

Df Sum Sq Mean Sq F value Pr(>F) 1 863.2 863.2 288.527 9.30e-10 *** 1 104.6 104.6 34.957 7.11e-05 *** 1 69.8 69.8 23.345 0.000411 *** 1 0.9 0.9 0.316 0.584226 3 440.2 146.7 49.045 5.21e-07 *** 12 35.9 3.0

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that Contrast 3 has a ℘-value of 4e-04, which suggests that the mean washing time of machines 2, 3, and 4 is significantly different from the mean washing time of machine 5. (k) > library(plyr) # load package > mdf mdf 1 2 3 4 5

machine MeanMachine Machine 1 20.9625 Machine 2 10.2375 Machine 3 26.8875 Machine 4 31.0125 Machine 5 17.8875

> MSE MSE [1] 2.99175 > ME ME [1] 1.884311 > ggplot(data = mdf, aes(x = machine, y = MeanMachine, fill = machine)) + + geom_bar(stat = "identity") +

K14521_SM-Color_Cover.indd 512

30/06/15 11:51 am

Chapter 11: + + + + + +

Experimental Design

507

geom_errorbar(aes(ymin = MeanMachine - ME, ymax = MeanMachine + ME), width = 0.25) + guides(fill = FALSE) + labs(x = "", y = "Mean Wash Time (in minutes)", title = "Mean Washing Time by Machine \n with Individual 95% CIs") + theme_bw() Mean Washing Time by Machine with Individual 95% CIs

Mean Wash Time (in minutes)

30

20

10

0 Machine 1

Machine 2

Machine 3

Machine 4

Machine 5

13. The Environmental Protection Agency (EPA) is interested in the fuel consumption of older vehicles. An experiment is designed where the gallons of gasoline consumed by vehicles over six years old are measured when the same driver travels 162.78 miles from Boone, North Carolina, to Durham, North Carolina, in 35 different vehicles. Seven vehicles are randomly selected from each category to be tested. The categories are compact, station wagon, minivan, van, and full-size pickup truck. The data obtained (gallons consumed) are given in the following table: Compact Station Wagon Minivan Van Pickup Truck

4.35 5.47 9.37 8.61 20.09

4.96 4.82 6.35 5.33 7.43 8.40 8.66 10.12 14.93 13.38

4.62 6.25 6.76 8.06 16.53

4.32 5.44 8.62 9.31 13.79

4.70 5.73 7.53 6.75 12.44

4.82 5.64 7.54 8.14 14.73

(a) Based on the described randomization, what type of design structure did the EPA use? (b) Propose a statistical model to analyze these data. (c) Are the assumptions for the model specified in part (b) satisfied? If the assumptions for the model specified in part (b) are not satisfied, suggest fixes before advancing to the next question.

K14521_SM-Color_Cover.indd 513

30/06/15 11:51 am

508

Probability and Statistics with R, Second Edition: Exercises and Solutions

(d) Are there significant differences between the fuel consumption for the five types of vehicles? (e) Estimate the model’s error variance. (f) What conclusions can be drawn from the data? Solution: (a) The design structure used by the EPA is a completely randomized design (CRD). (b) The model to use is Yij = µ + τi + εij

i = 1, 2, 3, 4, 5

εij ∼ N (0, σ).

j = 1, . . . , 7,

(c) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function. > + + + + > + > > > >

fuel TR TR Df Sum Sq Mean Sq F value Pr(>F) vehicle 4 2362.5 590.6 138.1 + > > + > > + +

K14521_SM-Color_Cover.indd 517

resingrams ggplot(data = pines, aes(x = shape, y = resingrams, + colour = acidtreatment, group = acidtreatment,

K14521_SM-Color_Cover.indd 518

30/06/15 11:51 am

Chapter 11: + + + + +

Experimental Design

513

linetype = acidtreatment)) + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean, geom = "line") + theme_bw() + labs(x = "", y = "Resin in grams") 100

100

75 shape Check Circle Diagonal

50

Rectangle

Resin in grams

Resin in grams

75

25

acidtreatment Acid Control

50

25

Acid

Control

Check

Circle

Diagonal

Rectangle

The parallel lines in both plots corroborates the assumption of no interaction between the factors acid treatment and hole shape. (d) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function. > checking.plots(pines.aov)

10

15

0

1

12

−1

standardized residuals

21

11

−2

2 1 0 −1

5

20

−2

−1

0

1

2

ordered values

Theoretical Quantiles

Standardized residuals versus fitted values for pines.aov

Density plot of standardized residuals for pines.aov

0.3 0.2 0.1

−1

0

Density

1

2

12 21

20

40

60

fitted values

80

0.0

11

−2

standardized residuals

21

11

−2

standardized residuals

12

Normal Q−Q plot of standardized residuals from pines.aov 2

Standardized residuals versus ordered values for pines.aov

100

−3

−2

−1

0

1

2

3

N = 24 Bandwidth = 0.4869

The assumption of constant variance is a little questionable but the other assumptions appear to be satisfied. (e)

K14521_SM-Color_Cover.indd 519

30/06/15 11:51 am

514

Probability and Statistics with R, Second Edition: Exercises and Solutions

> EFF EFF Tables of effects acidtreatment acidtreatment Acid Control 7.375 -7.375 shape shape Check 14.37

Circle -44.96

Diagonal Rectangle -1.13 31.71

acidtreatment:shape shape acidtreatment Check Circle Diagonal Rectangle Acid 0.625 -5.042 0.792 3.625 Control -0.625 5.042 -0.792 -3.625 Estimates for αi and βj are (7.375, -7.375) and (14.375, -44.9583, -1.125, 31.7083), respectively. (f) > pinesNI.aov TR TR Df Sum Sq Mean Sq F value Pr(>F) acidtreatment 1 1305 1305 25.87 6.56e-05 *** shape 3 19407 6469 128.20 8.71e-13 *** Residuals 19 959 50 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The completely additive model suggests both factors acidtreatment and shape are significant based on the ℘-values 1e-04 and 0, respectively. (g) > CI CI Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = resingrams ~ acidtreatment + shape, data = pines) $shape Circle-Check

K14521_SM-Color_Cover.indd 520

diff lwr upr p adj -59.33333 -70.865641 -47.801025 0.0000000

30/06/15 11:51 am

Chapter 11:

Experimental Design

Diagonal-Check -15.50000 -27.032308 Rectangle-Check 17.33333 5.801025 Diagonal-Circle 43.83333 32.301025 Rectangle-Circle 76.66667 65.134359 Rectangle-Diagonal 32.83333 21.301025

-3.967692 28.865641 55.365641 88.198975 44.365641

515 0.0063690 0.0023646 0.0000000 0.0000000 0.0000009

The mean resin collected for rectangular shaped holes is significantly greater than the mean resin collected for check, circular, and diagonal shapes. 15. The data stored in COWS were extracted from a Canadian record book of purebred dairy cattle. Random samples of 10 mature (five-year-old and older) and 10 two-year-old cows were taken from each of five breeds. The average butterfat percentage of these 100 cows is stored in the variable butterfat, with the type of cow stored in the variable breed and the age of the cow stored in the variable age. (a) Create a two-way ANOVA table. (b) Analyze the residuals and comment on whether the two-factorial model with interaction fits the data. (c) If there are problems that might be remedied with a transformation, suggest an appropriate transformation and reanalyze the new model. (d) Create a graphical display of the interactions for the model selected in (c). Is there significant interaction between breed and age? (e) Based on the model selected in (c), compute group means and parameter estimates to fill in a table similar to Table 11.18. (f) Using αe = 0.05, which breed has the highest average butterfat percentage? Solution: (a) > cows.aov summary(cows.aov) Df Sum Sq Mean Sq F value Pr(>F) age 1 0.21 0.207 1.172 0.282 breed 4 34.32 8.580 48.595 checking.plots(cows.aov)

K14521_SM-Color_Cover.indd 521

30/06/15 11:51 am

516

Probability and Statistics with R, Second Edition: Exercises and Solutions Standardized residuals versus ordered values for cows.aov 99

96 0

20

40

60

80

2 1 0 −1

standardized residuals

−3

2 1 0 −1 −3

standardized residuals

3

8299

3

82

Normal Q−Q plot of standardized residuals from cows.aov

96

100

−3

−2

−1

0

1

2

3

ordered values

Theoretical Quantiles

Standardized residuals versus fitted values for cows.aov

Density plot of standardized residuals for cows.aov 0.4 0.3

Density 96 4.0

4.5

5.0

fitted values

0.0

0.1

0.2

2 1 0 −1 −3

standardized residuals

3

8299

−4

−2

0

2

4

N = 100 Bandwidth = 0.305

There appears to be an increasing variance with larger butterfat values. The model does not satisfy the homogeneity of variance assumption. (c) The boxCox() function from the car package is used on the object cows.aov. > boxCox(cows.aov, lambda = seq(-3, 0, length = 300)) Error in eval(expr, envir, enclos):

could not find function "boxCox"

The 95% confidence interval for λ extends from roughly -2.4 to -0.5. Since a transformation using λ = −1 is inside the 95% confidence interval as well as being a monotonic transformation, the decision to use λ = −1 is made. > cowsTI.aov TR TR Df age 1 breed 4 age:breed 4 Residuals 90 --Signif. codes:

Sum Sq 0.00035 0.08797 0.00076 0.03266

Mean Sq F value Pr(>F) 0.000355 0.977 0.326 0.021993 60.599 checking.plots(cowsTI.aov)

K14521_SM-Color_Cover.indd 522

30/06/15 11:51 am

Chapter 11:

Experimental Design

Standardized residuals versus ordered values for cowsTI.aov

60

80

3 2 1 0

standardized residuals 40

9648

100

65 −3

−2

−1

0

1

2

3

Theoretical Quantiles

Standardized residuals versus fitted values for cowsTI.aov

Density plot of standardized residuals for cowsTI.aov 0.4

ordered values

48

96

0.2 0.1

−3 −2 −1

0

Density

1

0.3

2

3

20

−3 −2 −1

2 1

65 0

65 0.20

0.22

0.24

0.26

0.0

standardized residuals

Normal Q−Q plot of standardized residuals from cowsTI.aov

96

0 −3 −2 −1

standardized residuals

3

48

517

0.28

−2

fitted values

0

2

4

N = 100 Bandwidth = 0.3601

After the butterfat values have been transformed, increasing variance no longer appears problematic. Although there are five observations whose standardized residuals are greater in absolute value than two, this is to be expected with one hundred observations. (d) > ggplot(data = COWS, aes(x = age, y = butterfat^-1, colour = breed, + group = breed, linetype = breed)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw() + + labs(x = "", y = expression(butterfat^{-1})) > ggplot(data = COWS, aes(x = breed, y = butterfat^-1, colour = age, + group = age, linetype = age)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw() + + labs(x = "", y = expression(butterfat^{-1})) 0.28

0.28

0.26

0.26 breed

butterfat−1

Canadian Guernsey Holstein−Friesian Jersey

0.22

0.20

age 2 years old Mature

0.22

0.20

2 years old

K14521_SM-Color_Cover.indd 523

0.24

butterfat−1

Ayrshire

0.24

Mature

Ayrshire

Canadian

Guernsey Holstein−Friesian

Jersey

30/06/15 11:51 am

518

Probability and Statistics with R, Second Edition: Exercises and Solutions

The interaction plots show relatively parallel lines suggesting age and breed do not interact which is corroborated with the interaction ℘-value = 0.7169 computed in (d). (e) > model.tables(cowsTI.aov, type = "means") Tables of means Grand mean 0.2285625 age age 2 years old 0.22668

Mature 0.23045

breed breed Ayrshire 0.24730 Jersey 0.19121

Canadian 0.22671

Guernsey Holstein-Friesian 0.20388 0.27372

age:breed breed age Ayrshire Canadian Guernsey Holstein-Friesian Jersey 2 years old 0.24426 0.22282 0.19941 0.27691 0.18999 Mature 0.25034 0.23059 0.20835 0.27054 0.19242 > model.tables(cowsTI.aov, type = "effects") Tables of effects age age 2 years old -0.0018833

Mature 0.0018833

breed breed Ayrshire 0.01874 Jersey -0.03736

Canadian -0.00186

Guernsey Holstein-Friesian -0.02468 0.04516

age:breed breed age Ayrshire Canadian Guernsey Holstein-Friesian Jersey 2 years old -0.001156 -0.002000 -0.002586 0.005069 0.000673 Mature 0.001156 0.002000 0.002586 -0.005069 -0.000673

K14521_SM-Color_Cover.indd 524

30/06/15 11:51 am

Chapter 11:

Experimental Design

519

Table 11.1: Group means and parameter estimates for the object cowsTI.aov Breed

2 year old Age mature Y •j• ˆ = β j

α ˆ i = Y i••

Ayrshire

Canadian

Guernsey

H-F

Jersey

Y 11• = 0.2443

Y 12• = 0.2228

Y 13• = 0.1994

Y •1• = 0.2473

Y •2• = 0.2267 ˆ = −0.00186 β 2

Y •3• = 0.2039 Y •4• = 0.2737 Y •5• = 0.1912 ˆ = −0.02468 β ˆ = 0.04516 β ˆ = −0.03736 β 3 4 5

Y i••

−Y •••

Y 14• = 0.2769 Y 15• = 0.1900 Y 1•• = α ˆ1 = α β 11 = −0.0011 α β 12 = −0.0020 α β 13 = −0.0025 α β 14 = 0.0051 α β 15 = 0.0007 0.2267 −0.001883 Y 21• = 0.2503 Y 22• = 0.2306 Y 23• = 0.2084 Y 24• = 0.2705 Y 25• = 0.1924 Y 2•• = α ˆ2 = 0.001883 α β 21 = 0.0011 α β 22 = 0.0020 α β 23 = 0.0026 α β 24 = 0.0051 α β 25 = −.0007 0.2304 ˆ = 0.1874 β 1

Y •j• − Y •••

Y ••• = 0.2286

Using the results from the function model.tables(), Table 11.1 is created. (f) To determine which breeds have higher butterfat production, Tukey’s HSD pairwise confidence intervals are created using the function TukeyHSD(). > CIt CIt Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = I(butterfat^-1) ~ age * breed, data = COWS) $breed Canadian-Ayrshire Guernsey-Ayrshire Holstein-Friesian-Ayrshire Jersey-Ayrshire Guernsey-Canadian Holstein-Friesian-Canadian Jersey-Canadian Holstein-Friesian-Guernsey Jersey-Guernsey Jersey-Holstein-Friesian

diff -0.02059248 -0.04341678 0.02642525 -0.05609240 -0.02282430 0.04701773 -0.03549992 0.06984203 -0.01267562 -0.08251765

lwr -0.037363459 -0.060187757 0.009654271 -0.072863376 -0.039595272 0.030246756 -0.052270892 0.053071053 -0.029446594 -0.099288622

upr -0.003821510 -0.026645808 0.043196220 -0.039321427 -0.006053323 0.063788705 -0.018728942 0.086613003 0.004095355 -0.065746673

p adj 0.0082292 0.0000000 0.0002965 0.0000000 0.0024803 0.0000000 0.0000006 0.0000000 0.2274154 0.0000000

At the αe = 0.05 all breeds are significantly different from one another with the exception of Jersey and Guernsey. 16. Photosynthesis in aquatic plants is often inhibited due to the salinity of the water. Some plants such as Cymodocea nodosa seagrass appear to thrive in waters with high salinity. To determine the stress of Cymodocea nodosa seagrass seedlings in four levels of salinity (05PSU, 11PSU, 18PSU, and 36PSU), with two levels of spermidine (NO, YES), plant stress was determined by taking the ratio of Fv /Fm of four vessels each with two deceased nodosa seagrass seedlings that were randomly assigned to the eight treatments. Fv is the variable fluorescence, and Fm is the maximal fluorescence. The ratio Fv /Fm is stored under the variable name fluorescence in the SEAGRASS.csv file. The treatment structure is a 2 × 4 factorial experiment with 32 experimental units, where an experimental

K14521_SM-Color_Cover.indd 525

30/06/15 11:51 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

520

unit is a vessel containing two Cymodocea nodosa seagrass seedlings. The data stored at https://raw.github.com/alanarnholt/Data/master/SEAGRASS.csv is part of a larger study by Elso et al. (2012). Plants not under stress typically have Fv /Fm values between 0.7 and 0.8. Salinity for this study was recorded in practical salinity units (PSU), where 36PSU corresponds to typical ocean salinity. (a) Download the SEAGRASS.csv file using the source_data() function from the repmis package and store the results in an object named SEAGRASS. (b) Find and report the mean and standard deviation of fluorescence for the 8 treatment combinations. Does it appear that plants are more stressed without spermidine and at lower levels of salinity? (c) Are the assumptions for a factorial model satisfied with this data? (d) Create interaction plots for the factors spermidine and salinity. Based on your graphs, is there interaction between spermidine and salinity? (e) Write the hypotheses to test the main effects and the interaction for a 2 factor factorial design. (f) Test the hypotheses from (e). (g) Create and plot 99% family-wise confidence intervals for the pair-wise differences of the factors spermidine and salinity using the function TukeyHSD(). Interpret your confidence intervals. (h) Compute the means and the effects for the variables spermidine and salinity in the factorial model using the function model.tables(). (i) Assume the true means for the eight treatments are: > > > + >

MEANS with(data = SEAGRASS, + tapply(fluorescence, list(spermidine, salinity), mean)) 05PSU 11PSU 18PSU 36PSU NO 0.5827917 0.7074583 0.7736958 0.7818125 YES 0.6700104 0.7239271 0.8026875 0.7990209 > with(data = SEAGRASS, + tapply(fluorescence, list(spermidine, salinity), sd)) 05PSU 11PSU 18PSU 36PSU NO 0.05174214 0.06391753 0.01106553 0.00976514 YES 0.02473667 0.04936795 0.02479287 0.01492331 (c) Neither the qqPlot() nor the Shapiro-Wilk normality test suggest any deviations from normality with respect to the error terms of the factorial model stored in SEAGRASS.mod. The Levene’s test of homogeneity of variance finds insufficient evidence to suggest variances are not equal. The assumptions for a factorial model appear to be satisfied. > > > >

SEAGRASS.mod leveneTest(SEAGRASS.mod)

2 1 0 −1 −2 −3

Studentized Residuals(SEAGRASS.mod)

Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 7 0.8973 0.5243 24

−2

−1

0

1

2

norm Quantiles

(d)

> ggplot(data = SEAGRASS, aes(x = salinity, y = fluorescence, + color = spermidine, group = spermidine, + linetype = spermidine)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw() > ggplot(data = SEAGRASS, aes(x = spermidine, y = fluorescence, + color = salinity, group = salinity, + linetype = salinity)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw()

K14521_SM-Color_Cover.indd 528

30/06/15 11:51 am

Chapter 11:

Experimental Design

0.75

0.75

spermidine

0.70

NO YES

0.65

0.60

0.60

11PSU

salinity

18PSU

05PSU

0.70

0.65

05PSU

salinity

fluorescence

0.80

fluorescence

0.80

523

36PSU

11PSU 18PSU 36PSU

NO

spermidine

YES

The lines in the second graph cross, suggesting a small amount of interaction may be present. (e) The hypotheses to test the main effects and the interaction for a 2 factor factorial design are: • H0 : αi = 0 for all i versus H1 : αi = 0 for some i. • H0 : βi = 0 for all i versus H1 : βi = 0 for some i. • H0 : αβij = 0 for all (i, j) versus H1 : αβij = 0 for some (i, j). (f) > SEAGRASS.mod SR SR Df Sum Sq Mean Sq F value Pr(>F) spermidine 1 0.01123 0.01123 8.270 0.00832 ** salinity 3 0.14379 0.04793 35.285 5.83e-09 *** spermidine:salinity 3 0.00680 0.00227 1.668 0.20044 Residuals 24 0.03260 0.00136 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Based on the ℘-value of 0.0083, there is evidence to suggest some effect levels of spermidine are not zero. Based on the ℘-value of 0, there is evidence to suggest some effect levels of salinity are not zero. Based on the ℘-value of 0.2004, there is not sufficient evidence to suggest any effect interactions are not zero. (g) We can be 99% confident that adding spermidine reduces (increases the value of fluorescence) plant stress. We can be 99% confident that all pair-wise levels of salinity except 36PSU and 18PSU are significantly different in the amount of stress they produce on Cymodocea nodosa seagrass seedlings. > plot(TukeyHSD(SEAGRASS.mod, which = c("spermidine", "salinity"), + conf.level = 0.99))

K14521_SM-Color_Cover.indd 529

30/06/15 11:51 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

524

99% family−wise confidence level

0.00

0.02

0.04

0.06

Differences in mean levels of spermidine

36PSU−18PSU

18PSU−11PSU

YES−NO

18PSU−05PSU

99% family−wise confidence level

−0.05

0.00

0.05

0.10

0.15

0.20

Differences in mean levels of salinity

(h) The requested values are: > model.tables(SEAGRASS.mod, type = "means", se = TRUE, + cterms = c("spermidine", "salinity")) Tables of means Grand mean 0.7301755 spermidine spermidine NO YES 0.7114 0.7489 salinity salinity 05PSU 11PSU 18PSU 36PSU 0.6264 0.7157 0.7882 0.7904 Standard errors for differences of means spermidine salinity 0.01303 0.01843 replic. 16 8 > model.tables(SEAGRASS.mod, type = "effects", se = TRUE, + cterms = c("spermidine", "salinity")) Tables of effects spermidine spermidine NO YES -0.018736 0.018736

K14521_SM-Color_Cover.indd 530

30/06/15 11:51 am

Chapter 11: salinity salinity 05PSU 11PSU -0.10377 -0.01448

18PSU 0.05802

Experimental Design

525

36PSU 0.06024

Standard errors of effects spermidine salinity 0.009214 0.013031 replic. 16 8 (i) A function is written to find λ for each value of σ and then return the power for H0 : βi = 0 for all i versus H1 : βi = 0 for some i assuming n = 4. > > > > > > > + + + + + + + + + + + >

alpha > >

library(car) TR1 F) group 3 2.5623 0.06266 . 63 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Levene’s test does not reject the null hypothesis of constant variance at the α = 0.05 level for either stage (℘-value = 0.314) or defoliation (℘-value = 0.0627). The assumptions for Model (B) appear to be satisfied.

Model (C)

> modelC.aov TR TR Df Sum Sq Mean Sq F value Pr(>F) stage 4 13997052 3499263 5.343 0.000968 *** defoli 3 21878479 7292826 11.136 6.85e-06 *** Residuals 59 38638596 654891 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(i) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function.

> checking.plots(modelC.aov)

K14521_SM-Color_Cover.indd 539

30/06/15 11:51 am

534

Probability and Statistics with R, Second Edition: Exercises and Solutions

30

2 1 0 −1

standardized residuals

2 1 0 −1 −2

20

−3

39

40

50

60

−3

−2

−1

0

1

2

3

Theoretical Quantiles

Standardized residuals versus fitted values for modelC.aov

Density plot of standardized residuals for modelC.aov 0.4

ordered values

2

70

0.2

−2

0.1

−1

0

Density

1

0.3

2

3

10

70 2

3

70

39 0

0

500

1500

0.0

39

−3

standardized residuals

Normal Q−Q plot of standardized residuals from modelC.aov

−2

2

−3

standardized residuals

3

Standardized residuals versus ordered values for modelC.aov

2500

−3

fitted values

−2

−1

0

1

2

3

N = 67 Bandwidth = 0.3511

The model appears adequate. (ii) The model’s effects are estimated with the function model.tables(). The values for the decomposition of the Yi,j s are obtained using the function proj() and stored in the object EFF. > model.tables(modelC.aov, type = "effects") Tables of effects stage stage1 stage2 stage3 stage4 stage5 181.1 1.982 666.9 70.38 -815.1 rep 12.0 16.000 11.0 15.00 13.0 defoli control treat1 treat2 treat3 347 549.2 -13.84 -950.1 rep 16 18.0 17.00 16.0 > EFF head(EFF) 1 2 3 4 5 6

K14521_SM-Color_Cover.indd 540

(Intercept) stage defoli Residuals 1494.955 181.128109 366.5917 -375.6750 1494.955 181.128109 366.5917 1783.3250 1494.955 181.128109 366.5917 329.3250 1494.955 1.982276 366.5917 197.4708 1494.955 1.982276 366.5917 791.4708 1494.955 1.982276 366.5917 -1675.5292

30/06/15 11:51 am

Chapter 11:

Experimental Design

535

> VAL VAL[1:10] 1 2 3 4 5 1667 3826 2372 2061 2655

6 7 9 10 11 188 2309 3352 2987 1685

> SUNFLOWER[-TooBig, ]$yield[1:10] [1] 1667 3826 2372 2061 2655

188 2309 3352 2987 1685

Note that the values stored in the object VAL are the same as the original yield values used to construct Model (C).

(iii)

> CI CI Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ stage + defoli, data = SUNFLOWER[-TooBig, ]) $defoli diff lwr upr p adj treat1-control 202.2444 -532.8707 937.3595 0.8857552 treat2-control -360.8110 -1106.0312 384.4093 0.5791269 treat3-control -1297.0928 -2053.5200 -540.6656 0.0001663 treat2-treat1 -563.0554 -1286.6335 160.5228 0.1792500 treat3-treat1 -1499.3372 -2234.4523 -764.2221 0.0000075 treat3-treat2 -936.2818 -1681.5021 -191.0616 0.0081636 > > > >

K14521_SM-Color_Cover.indd 541

opar CI CI Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ stage + defoli, data = SUNFLOWER[-TooBig, ]) $stage diff lwr upr p adj stage2-stage1 -179.14583 -1048.7577 690.46602 0.9775671 stage3-stage1 485.73485 -464.8130 1436.28274 0.6060833 stage4-stage1 -110.75000 -992.6974 771.19739 0.9965671 stage5-stage1 -996.23718 -1907.8381 -84.63626 0.0254664 stage3-stage2 664.88068 -227.0325 1556.79390 0.2348146 stage4-stage2 68.39583 -750.0167 886.80838 0.9993030 stage5-stage2 -817.09135 -1667.3761 33.19339 0.0653308 stage4-stage3 -596.48485 -1500.4293 307.45962 0.3517722 stage5-stage3 -1481.97203 -2414.8711 -549.07297 0.0003382 stage5-stage4 -885.48718 -1748.3838 -22.59057 0.0415604 > > > >

K14521_SM-Color_Cover.indd 542

opar > > >

ND + + + + > > + >

ggplot(data = ND, aes(x = stage, y = yield, fill = stage)) + geom_boxplot() + theme_bw() + labs(y = "Sunflower Yield (kg/ha)", x = "") + guides(fill = FALSE) library(plyr) mdf ggplot(data = mdf, aes(x = stage, y = MeanStage, fill = stage)) + + geom_bar(stat = "identity") + + geom_errorbar(aes(ymin = MeanStage - SE, ymax = MeanStage + SE), + width = 0.30) + + guides(fill = FALSE) + + theme_bw() + + labs(y = "Sunflower Yield (kg/ha)", x = "")

2500

4000

2000

Sunflower Yield (kg/ha)

Sunflower Yield (kg/ha)

3000

1500

2000

1000

1000

500

0

0 stage1

stage2

stage3

stage4

stage5

stage1

stage2

stage3

stage4

stage5

(j)

K14521_SM-Color_Cover.indd 544

30/06/15 11:51 am

Chapter 11:

Experimental Design

539

> levels(ND$defoli) levels(ND$defoli) [1] "CoT1" "T2T3" > + + + + > > + >

ggplot(data = ND, aes(x = defoli, y = yield, fill = defoli)) + geom_boxplot() + theme_bw() + labs(y = "Sunflower Yield (kg/ha)", x = "") + guides(fill = FALSE) library(plyr) mdf ggplot(data = mdf, aes(x = defoli, y = MeanStage, fill = defoli)) + + geom_bar(stat = "identity") + + geom_errorbar(aes(ymin = MeanStage - SE, ymax = MeanStage + SE), + width = 0.30) + + guides(fill = FALSE) + + theme_bw() + + labs(y = "Sunflower Yield (kg/ha)", x = "")

3000

1500

Sunflower Yield (kg/ha)

2000

Sunflower Yield (kg/ha)

4000

1000

2000

500

1000

0

0 CoT1

K14521_SM-Color_Cover.indd 545

T2T3

CoT1

T2T3

30/06/15 11:51 am

K14521_SM-Color_Cover.indd 546

30/06/15 11:51 am

Chapter 12 Regression

1. The manager of a URL commercial address is interested in predicting the number of megabytes downloaded, megasd, by clients according to the number of minutes they are connected, mconnected. The manager randomly selects (megabyte, minute) pairs, records the data, and stores the pairs (megasd, mconnected) in the file URLADDRESS. (a) Create a scatterplot of the data. Characterize the relationship between megasd and mconnected. (b) Fit a regression line to the data. Superimpose the resulting line in the plot created in part (a). ˆ (c) Compute the covariance matrix of the βs. (d) What is the standard error of βˆ1 ? (e) What is the covariance between βˆ0 and βˆ1 ? (f) Construct a 95% confidence interval for the slope of the regression line. (g) Compute R2 , Ra2 , and the residual variance for the fitted regression. (h) What assumptions need to be satisfied in order to use the model from part (b) for inferential purposes? (i) Are there any outlying observations? (j) Are there any influential observations? Compute and graph Cook’s distances, DFFITS, and DFBETAS to answer this question. Create a bubble plot of studentized residuals versus leverage values with plotted points proportional to Cook’s distance using the function influencePlot() from the car package. Does the bubble plot confirm your answer with respect to influential observations? (k) Estimate the mean value of megabytes downloaded by clients spending 5, 10, and 15 minutes on line. Construct the corresponding 90% confidence intervals. (l) Predict the megabytes downloaded by a client spending 30 minutes on line. Construct the corresponding 90% prediction interval. Solution: (a) > ggplot(data = URLADDRESS, aes(x = mconnected, y = megasd)) + + geom_point() + + theme_bw()

541

K14521_SM-Color_Cover.indd 547

30/06/15 11:51 am

542

Probability and Statistics with R, Second Edition: Exercises and Solutions 200

megasd

150

100

50

5

10

mconnected

15

20

Based on the graph, the relationship between megasd and mconnected is positive, linear, and strong.

(b)

> mod1a mod1a Call: lm(formula = megasd ~ mconnected, data = URLADDRESS) Coefficients: (Intercept) mconnected 6.189 9.831 > ggplot(data = URLADDRESS, aes(x = mconnected, y = megasd)) + + geom_point() + + theme_bw() + + geom_smooth(method = "lm")

K14521_SM-Color_Cover.indd 548

30/06/15 11:51 am

Chapter 12:

Regression

543

200

megasd

150

100

50

5

10

mconnected

15

20

The least squares regression line is Y = 6.189 + 9.8313x.

ˆ is computed with the function vcov(). (c) The variance matrix of the βs > vcov(mod1a) (Intercept) mconnected

(Intercept) mconnected 10.8661271 -0.85425034 -0.8542503 0.08931002

(d) > seb1hat seb1hat [1] 0.2988478 > # or > TR TR Estimate Std. Error t value Pr(>|t|) (Intercept) 6.188972 3.2963809 1.877505 7.090274e-02 mconnected 9.831263 0.2988478 32.897221 6.369479e-24 > seb1hat seb1hat [1] 0.2988478 ˆβˆ1 = sβˆ1 = 0.2988. The standard error of βˆ1 , σ

K14521_SM-Color_Cover.indd 549

30/06/15 11:51 am

544

Probability and Statistics with R, Second Edition: Exercises and Solutions

(e) > vcov(mod1a) (Intercept) mconnected

(Intercept) mconnected 10.8661271 -0.85425034 -0.8542503 0.08931002

> covb0b1 covb0b1 [1] -0.8542503 The covariance between βˆ0 and βˆ1 i s-0.8543. (f) > CI CI 2.5 % 97.5 % (Intercept) -0.5633578 12.94130 mconnected 9.2191007 10.44342 The 95% confidence interval for the slope of the regression line from part (b) is CI 0.95 (β1 ) = [9.2191, 10.4434]. (g) > summary(mod1a) Call: lm(formula = megasd ~ mconnected, data = URLADDRESS) Residuals: Min 1Q -17.4601 -5.0653

Median 0.0563

3Q 5.0002

Max 26.3222

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.1890 3.2964 1.878 0.0709 . mconnected 9.8313 0.2988 32.897 summary(mod1a)$r.squared [1] 0.9747799 > summary(mod1a)$adj.r.squared

K14521_SM-Color_Cover.indd 550

30/06/15 11:51 am

Chapter 12:

Regression

545

[1] 0.9738792 > summary(mod1a)$sigma [1] 8.992034 > summary(mod1a)$sigma^2 [1] 80.85668 The R2 , Ra2 , and residual variance are 0.9748, 0.9739, and 80.8567, respectively. (h) The errors are assumed to be independent, follow a normal distribution with mean zero, and have constant variance. Since the errors are unobservable, the residuals are analyzed.

Residuals vs Fitted

Normal Q−Q

50

100

150

3 2 1 −2

22

0

10 0

Residuals

−20

8

13

−1

Standardized residuals

13

20

30

> par(mfrow = c(1, 2)) > plot(mod1a, which = 1:2) > par(mfrow = c(1, 1))

200

Fitted values

22

−2

8

−1

0

1

2

Theoretical Quantiles

The normality assumptions for the errors appears reasonable based on the graphs of the residuals. (i) > TR TR rstudent unadjusted p-value Bonferonni p 13 3.536669 0.0014863 0.04459 Using the Bonferroni approach to test for outliers with the function outlierTest() from the car package, observation 13 is considered an outlier (℘-value = 0.0446) at the α = 0.05 level. (j)

K14521_SM-Color_Cover.indd 551

30/06/15 11:52 am

546

Probability and Statistics with R, Second Edition: Exercises and Solutions

> n n [1] 30 > > + > > > > + > > > >

p + > > > >

# plot(dfbetas(mod1a)[,1], type = "h", ylim = c(-1, 1), ylab = "", main = substitute(paste("DFBETAS for ", hat(beta)[0]))) CV CV)

named integer(0) > > + > > > >

# plot(dfbetas(mod1a)[,2], type = "h", ylim = c(-1, 1), ylab = "", main = substitute(paste("DFBETAS for ", hat(beta)[1]))) CV CV)

22 22

K14521_SM-Color_Cover.indd 552

30/06/15 11:52 am

Chapter 12:

Regression

547 DFFITS

0.0

−1.0

0.2

−0.5

0.4

0.0

0.6

0.5

0.8

1.0

1.0

Cook's Distance

10

15

20

25

30

0

5

10

15

20

Index

Index

^ DFBETAS for β0

^ DFBETAS for β1

25

30

25

30

0.5 0.0 −0.5 −1.0

−1.0

−0.5

0.0

0.5

1.0

5

1.0

0

0

5

10

15

20

25

30

Index

0

5

10

15

20

Index

Based on the graphs, DFFITs flags observations 13 and 22 while DFBETAS for βˆ1 flags observation 22 for further scrutiny.

> influencePlot(mod1a) StudRes Hat CookD 13 3.5366686 0.03335346 0.39106806 16 0.2602743 0.13608497 0.07429144 22 -2.1953572 0.11099209 0.51453521

K14521_SM-Color_Cover.indd 553

30/06/15 11:52 am

548

Probability and Statistics with R, Second Edition: Exercises and Solutions

2 1 −1

0

16

−2

Studentized Residuals

3

13

22 0.04

0.06

0.08

0.10

0.12

0.14

Hat−Values

The bubble plot also flags observations 13 and 22 for further study. Observations 13 and 22 are potentially influential observations; however, without further knowledge of the data, there is little one can do other than omit the values and see if the regression line changes significantly. > mod1b coef(summary(mod1a)) Estimate Std. Error t value Pr(>|t|) (Intercept) 6.188972 3.2963809 1.877505 7.090274e-02 mconnected 9.831263 0.2988478 32.897221 6.369479e-24 > coef(summary(mod1b)) Estimate Std. Error t value Pr(>|t|) (Intercept) 4.231873 2.5714045 1.645744 1.118536e-01 mconnected 10.008235 0.2390842 41.860720 2.176567e-25 Observations 13 and 22 are marginally influential as the estimates for the intercept and slope as well as the standard errors for the intercept and slope are marginally different when observations 13 and 22 are removed. (k) > CI CI fit lwr upr 1 55.34529 51.71411 58.97646 2 104.50160 101.70009 107.30311 3 153.65791 149.72931 157.58652 The estimated mean value of megabytes downloaded by clients spending 5, 10, and 15 minutes on line is 55.3453, 104.5016, and 153.6579 megabytes, respectively. The individual

K14521_SM-Color_Cover.indd 554

30/06/15 11:52 am

Chapter 12:

Regression

549

90% confidence intervals for clients spending 5, 10, and 15 minutes on line, respectively, are CI 0.90 [E(Yh )] = [51.7141, 58.9765], CI 0.90 [E(Yh )] = [101.7001, 107.3031], and CI 0.90 [E(Yh )] = [149.7293, 157.5865].

(l) > PI PI fit lwr upr 1 301.1269 282.4263 319.8274 The predicted megabytes downloaded by a client spending 30 minutes on line is 301.1269. The 90% prediction interval for megabytes downloaded for a client spending 30 minutes on line is PI 0.90 Yh(new) = [282.4263, 319.8274]. 2. A metallurgic company is investigating lost revenue due to worker illness. It is interested in creating a table of lost revenue to be used for future budgets and company forecasting plans. The data are stored in the data frame LOSTR. (a) Create a scatterplot of lost revenue versus number of ill workers. Characterize the relationship between lostrevenue and numbersick. (b) Fit a regression line to the data. Superimpose the resulting line in the plot created in part (a). ˆ (c) Compute the covariance matrix of the βs. (d) Create a 95% confidence interval for β1 . (e) Compute the coefficient of determination and the adjusted coefficient of determination. Provide contextual interpretations of both values. (f) What assumptions need to be satisfied in order to use the model from part (b) for inferential purposes? If there is/are any outlier/s in the data, remove it/them prior to answering the remainder of the questions. (g) Determine the expected lost revenues when 5, 15, and 25 workers are absent due to illness. (h) Compute a 95% prediction interval of lost revenues when 14 workers are absent due to illness. Solution: (a)

K14521_SM-Color_Cover.indd 555

30/06/15 11:52 am

550

Probability and Statistics with R, Second Edition: Exercises and Solutions

> ggplot(data = LOSTR, aes(x = numbersick, y = lostrevenue)) + + geom_point() + + theme_bw()

3000

lostrevenue

2000

1000

0

10

numbersick

20

Based on the graph, the relationship between lostrevenue and numbersick is positive, linear, and strong. There is one outlier. (b)

> mod2b mod2b Call: lm(formula = lostrevenue ~ numbersick, data = LOSTR) Coefficients: (Intercept) numbersick 294.8 96.9 > ggplot(data = LOSTR, aes(x = numbersick, y = lostrevenue)) + + geom_point() + + theme_bw() + + geom_smooth(method = "lm")

K14521_SM-Color_Cover.indd 556

30/06/15 11:52 am

Chapter 12:

Regression

551

3000

lostrevenue

2000

1000

0

10

numbersick

20

The least squares regression line is Y = 294.8392 + 96.897x.

ˆ is computed with the function vcov(). (c) The variance matrix of the βs > vcov(mod2b) (Intercept) numbersick

(Intercept) numbersick 12724.0361 -741.70261 -741.7026 53.28323

(d) > CI CI 2.5 % 97.5 % (Intercept) 61.49279 528.1855 numbersick 81.79671 111.9972 The 95% confidence interval for the slope of the regression line from part (b) is CI 0.95 (β1 ) = [81.7967, 111.9972]. (e) > summary(mod2b)$r.squared [1] 0.8845437 > summary(mod2b)$adj.r.squared [1] 0.8795239

K14521_SM-Color_Cover.indd 557

30/06/15 11:52 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

552

The coefficient of determination and the adjusted coefficient of determination are 0.8845 and 0.8795, respectively. According to the linear model, 88.4544% of the variability in lost revenue is accounted for by variation in the number of sick employees.

(f) To use the model from part (b) for inferential purposes, the residuals should follow a normal distribution.

> > > >

par(mfrow = c(1, 2)) plot(mod2b, which = 1:2) par(mfrow = c(1, 1)) outlierTest(mod2b)

rstudent unadjusted p-value Bonferonni p 4 42.11082 1.5825e-22 3.9562e-21

6 10

1500

1

2

3

4

4

0

Standardized residuals

500 0

Residuals

1000

4

500

Normal Q−Q 5

Residuals vs Fitted

6

2500

Fitted values

−2

18

−1

0

1

2

Theoretical Quantiles

Observation 4 is an outlier since it has a studentized residual whose absolute value is greater than 3.505 = t1−0.05/(2n),25−2−1 .

> > > >

K14521_SM-Color_Cover.indd 558

mod2bNO CI CI fit lwr upr 1 703.8017 684.3172 723.2862 2 1703.7961 1691.9454 1715.6469 3 2703.7906 2681.5851 2725.9961 The expected lost revenues when 5, 15, and 25 workers are absent due to illness are $703.8017, $1703.7961, and $2703.7906, respectively. (h) > PI PI fit lwr upr 1 1603.797 1545.119 1662.474 The 95% prediction interval is PI 0.95 [Yh (new)] = [$1545.119, $1662.4744] for loss in revenue with 14 workers absent. 3. To obtain a linear relationship between the employment (number of employed people = dependent variable) and the GDP (gross domestic product = response variable), a researcher has taken data from 12 regions. Use the following information to answer the questions: 12 i=1

K14521_SM-Color_Cover.indd 559

xi = 581

12 i=1

x2i = 28507

12 i=1

xi Yi = 2630

12 i=1

Yi = 53

12

Yi2 = 267

i=1

30/06/15 11:52 am

554

Probability and Statistics with R, Second Edition: Exercises and Solutions Source

df

SS

Regression Error

* *

* 22.08

MS Fobs * *

℘-value

* *

* *

(a) Complete the ANOVA table. (b) Decide if the regression is statistically significant. (c) Compute and interpret the coefficient of determination. (d) Calculate the model’s residual variance. (e) Write out the fitted regression line and construct a 90% confidence interval for the slope. Solution: (a) > > > > > > > > > > > > >

DF > > + > > > > >

Probability and Statistics with R, Second Edition: Exercises and Solutions

DF

SSR tobs pvalue c(tobs, pvalue) [1] 1.076022 0.150741 Do not reject H0 . H0 : β2 = −1 versus H1 : β2 < −1: > tobs pvalue c(tobs, pvalue) [1] -4.1751357607

0.0005444476

Reject H0 . 5. Given a simple linear regression model, show (a) σ ˆ2 =

ˆ2i i n−2

is an unbiased estimator of σ 2 .

(b) The diagonal element of the hat matrix can be expressed as hii = xi (X X)−1 xi = where xi = (1, xi ).

K14521_SM-Color_Cover.indd 563

(xi − x 1 ¯ )2 + , n ¯ )2 i (xi − x

30/06/15 11:52 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

558 Solution:

2 ˆi iε = σ 2 . In a simple linear regression model, Yi = (a) It must be shown that E n−2 β0 + β 1 x i + εi . Summing over all i and dividing by n yields ¯ + ε¯i . Y = β0 + β 1 x Since εˆi = Yi − Yi and Yi = βˆ0 + βˆ1 xi ,

εˆi = Yi − (βˆ0 + βˆ1 xi ).

The point (¯ x, Y ) is always on the simple linear regression line, so ¯. βˆ0 = Y − βˆ1 x Substituting into the expression of εˆi for βˆ0 gives εˆi = (Yi − Y ) + βˆ1 (¯ x − xi ). Since Yi − Y = β1 (xi − x ¯) + (εi − ε¯), ¯) + (εi − ε¯). εˆi = (β1 − βˆ1 )(xi − x

(12.1)

Squaring and summing both sides of (12.1) yields n i=1

εˆ2i

=

n i=1

¯)2 + 2(β1 − βˆ1 )(xi − x ¯)(εi − ε¯) + (εi − ε¯)2 (β1 − βˆ1 )2 (xi − x

= (β1 − βˆ1 )2

n i=1

(xi − x ¯)2 + 2(β1 − βˆ1 )

First

n i=1

(xi − x ¯)(εi − ε¯) +

Middle

n i=1

(εi − ε¯)2

Last

The of the First, Middle, and Last expressions will be taken to ascertain expectations n E ˆ2i . Recall that i=1 ε σ2 Var βˆ1 = . (12.2) n 2 (x − x ¯ ) i i=1

The expected value of First is n n 2 2 (xi − x ¯) = (xi − x ¯)2 · E (β1 − βˆ1 )2 E (β1 − βˆ1 ) i=1

i=1

=

n

(xi − x ¯)2 · Var βˆ1

i=1

(xi − x ¯ ) 2 · n

i=1

=

n

= σ2

K14521_SM-Color_Cover.indd 564

σ2 ¯ )2 i=1 (xi − x

30/06/15 11:52 am

Chapter 12:

Regression

559

The expected value of Middle is n (xi − x ¯)(εi − ε¯) E 2(β1 − βˆ1 ) i=1

n

= −2E (βˆ1 − β1 ) 

i=1

 ˆ = −2E   (β 1 − β 1 )

= −2E (βˆ1 − β1 )

(xi − x ¯)(εi − ε¯)

= −2E (βˆ1 − β1 ) = −2

n i=1 2

= −2σ



 (xi − x ¯) εˆi − (β1 − βˆ1 )(xi − x ¯)   i=1

n

2

From (12.1)

n

εˆi (xi − x ¯) + (βˆ1 − β1 )

i=1

By property 3 of the fitted regression line, n regression line, i=1 εˆi = 0:

n i=1

n

i=1

n i=1

(xi − x ¯)

2

xi εˆi = 0, and property 1 of the fitted

(xi − x ¯)

2

(xi − x ¯)2 · Var βˆ1 By (12.2)

Recall that εi ∼ N (0, σ 2 ). This means E[ε2i ] = E[(εi − 0)2 ] = Var[εi ] = σ 2 . For simple linear regression, it is also true that the covariance of any two error terms is zero. This implies that E[εi εj ] = 0 if i = j. The expected value of Last is n n 2 2 2 (εi − ε¯) = E εi − 2εi ε¯ + ε¯ E i=1

i=1

=

n i=1

n

E ε2i − 2 E [εi ε¯] + nE ε¯2 i=1

ε 1 + ε 2 + · · · + εn = nσ 2 − 2 + nVar[¯ ε] E εi n i=1 n

σ2 n = nσ 2 − σ 2 = σ 2 (n − 1)

= nσ 2 − 2σ 2 + n ·

E

i

εˆ2i

= E[First + Middle + Last]

= σ 2 − 2σ 2 + σ 2 (n − 1) = (n − 2)σ 2 2

ˆ2i iε =E σ ˆ = σ 2 Check =⇒ E n−2

K14521_SM-Color_Cover.indd 565

30/06/15 11:52 am

560

Probability and Statistics with R, Second Edition: Exercises and Solutions

(b)

(X X)

−1

1

= n

n

i=1 (xi

hii = xi (X X)−1 xi

n

2 i−1 xi n − i=1 xi

−

n

i=1

xi

n

n 2 1 x − x i i=1 i i=1 = [1 xi ] n n x i n n i=1 (xi − x ¯)2 − i=1 xi n n 1 x2i − xi i=1 xi i=1 = [1 xi ] n n − i=1 xi + nxi n i=1 (xi − x ¯ )2 n n n 1 2 xi − xi xi + xi − xi + nxi = n i=1 i=1 n i=1 (xi − x ¯)2 i=1 n n 1 x2i − 2xi xi + nx2i = n 2 i=1 n i=1 (xi − x ¯) i=1 n

1

Recall that

−x ¯ )2

n

i=1

x2i

n

−x ¯=

n

x) i=1 (xi −¯

2

n



and

n

i=1

xi = n¯ x.

 2 x  i=1 i −¯ = x2 + x ¯2 − 2xi x ¯ + x2i  n n 2 ¯) Add for form. i=1 (xi − x n (x −x ¯ )2 1 i 2 i=1 + (xi − x ¯) = n n ¯ )2 i=1 (xi − x 1

=

n

1 (xi − x ¯ )2 + n n ¯ )2 i=1 (xi − x

Check

6. Show that (12.63) and (12.64) are algebraically equivalent: Di = Note: ri = HINT:

√ εˆi σ ˆ 1−hii

ˆ (X X)(βˆ(i) − β) ˆ (βˆ(i) − β) ri2 hii = . pˆ σ2 p 1 − hii

and hii = xi (X X)−1 xi .

X(i) X(i)

Solution:

−1

= (X X)−1 + (X X)−1 xi xi (X X)−1 .

(12.3)

1−hii

Since βˆ = (X X)−1 X Y, it follows that βˆ(i) where “(i)” means “without case i” is βˆ(i) = (X(i) X(i) )−1 X(i) Y(i) .

K14521_SM-Color_Cover.indd 566

(12.4)

30/06/15 11:52 am

Chapter 12:

Regression

561

Rewriting (12.4) using the HINT in (12.3) gives (X X)−1 xi xi (X X)−1 X(i) Y(i) βˆ(i) = (X X)−1 + 1 − hii where X = X(i) xi and Y = Y(i) Yi . Using the expressions for X and Y, βˆ = (X X)−1 X Y can be rewritten as

(12.5)

βˆ = (X X)−1 X(i) Y(i) + (X X)−1 xi Yi

(12.6)

Subtracting (12.6) from (12.5) gives −1 (X X) xi xi (X X)−1 X(i) Y(i) − (X X)−1 xi Yi βˆ(i) − βˆ = 1 − hii −1 x (X X) Xi Y(i) − Yi = (X X)−1 xi i 1 − hii −1 (X X) xi −1 xi (X X) Xi Y(i) − (1 − hii )Yi = 1 − hii Substituting for hii = xi (X X)−1 xi inside the brackets gives (X X)−1 xi 1 − hii (X X)−1 xi = 1 − hii =

xi (X X)−1 Xi Y(i) − (1 − xi (X X)−1 xi )Yi xi (X X)−1 Xi Y(i) + xi (X X)−1 xi Yi − Yi

Going back to the expression for βˆ from (12.6) gives =

(X X)−1 x (X X)−1 xi ˆ i [−ˆ εi ] x i β − Yi = 1 − hii 1 − hii

Keep this expression handy for problem 7: −(X X)−1 xi εˆi βˆ(i) − βˆ = 1 − hii

(12.7)

Di can now be written ˆ (X X)(βˆ(i) − β) ˆ (βˆ(i) − β) 2 pˆ σ −1 x (X X)−1 xi (X X) [−ˆ εi ] i (X X) [−ˆ εi ] 1 − hii 1 − hii = pˆ σ2 −1 εˆ x (X X) xi εˆi = i i2 pˆ σ (1 − hii )2 εˆi hii εˆi = pˆ σ 2 (1 − hii )2

Di =

K14521_SM-Color_Cover.indd 567

30/06/15 11:52 am

562

Probability and Statistics with R, Second Edition: Exercises and Solutions

Since ri =

εˆi εˆ2 and hii is a constant, , ri2 = 2 i σ ˆ (1 − hii ) σ ˆ 1 − hii √

Di =

ri2 hii p(1 − hii )

Check

7. Show that (12.65) and (12.66) are algebraically equivalent: Yi − Yi(i) √ = ri∗ DFFITSi = σ ˆ(i) hii

hii . 1 − hii

Solution: Recall that ri∗ = Also note that

εˆ √i . σ ˆ(i) 1 − hii Yi − Yi(i) = xi βˆ − xi βˆ(i)

The expression in (12.7) as well as that for hii allow simplification to (X X)−1 xi εˆi 1 − hii hii εˆi = 1 − hii

= xi Yi − Yi(i)

The original expression becomes

DFFITSi =

Yi − Yi(i) hii εˆi 1 √ √ = · 1 − hii σ σ ˆ(i) hii ˆ(i) hii √ εˆi · hii √ √ = σ ˆ(i) 1 − hii · 1 − hii hii = ri∗ Check 1 − hii

8. Show that the SSE in a linear model expressed in summation notation is equivalent to the SSE expressed in matrix notation: SSE =

n i=1

(Yi − Yi )2 = Y Y − βˆ X Y.

Solution: Note that βˆ = Y X(X X)−1 .

K14521_SM-Color_Cover.indd 568

30/06/15 11:52 am

Chapter 12:

SSE =

n i=1

Regression

563

(Yi − Yi )2 = εˆ ε

ˆ (Y − Xβ) ˆ = (Y − Xβ) = Y Y − βˆ X Y − Y Xβˆ + βˆ X Xβˆ

= Y Y − βˆ X Y − Y Xβˆ + Y X(X X)−1 X Xβˆ = Y Y − βˆ X Y Check

9. Show that the SSR in a linear model expressed in summation notation is equivalent to the SSR expressed in matrix notation: SSR =

n i=1

(Yi − Y¯ )2 = βˆ X Y −

1 Y JY. n

Solution: It is known that SSR = SST − SSE , so SSR =

n i=1

(Yi − Y¯ )2 =

n i=1

(Yi − Y¯ )2 − n

n i=1

(Yi − Yi )2

− Y Y − βˆ X Y n i=1 1 = Y Y − Y JY − Y Y − βˆ X Y n 1 = βˆ X Y − Y JY Check n =

n

Yi2 −

(

i=1

Yi )

2

10. Show that the trace of the hat matrix H is equal to p, the number of parameters (βs), in a multiple linear regression model. Solution: Note: Suppose a matrix A exists and is n × n. Then, (A) = For any matrices A, B, and C,

n

i=1

aii .

(ABC) = (BCA) = (CAB)

(12.8)

when such products exist. The projection matrix is H = X(X X)−1 X . X X(X X)−1 ) = (Ip×p ) = p (H) = (X(X X)−1 X ) = ( By (12.8)

K14521_SM-Color_Cover.indd 569

30/06/15 11:52 am

564

Probability and Statistics with R, Second Edition: Exercises and Solutions

11. The data frame HSWRESTLER contains information on nine variables for a group of 78 high school wrestlers that was collected by the human performance lab at Appalachian State University. The variables are age (in years), ht (height in inches), wt (weight in pounds), abs (abdominal skinfold measure), triceps (tricep skinfold measure), subscap (subscapular skinfold measure), hwfat (hydrostatic determination of fat), tanfat (Tanita determination of fat), and skfat (skinfold determination of fat). Use hwfat (Y ), abs (x1 ), and triceps (x2 ) to verify empirically the value obtained for SSR(x2 , x1 ) using quadratic forms. Solution: R Code 12.1 > mod1 anova(mod1) Analysis of Variance Table Response: hwfat Df Sum Sq Mean Sq F value Pr(>F) abs 1 5072.8 5072.8 541.365 < 2.2e-16 *** triceps 1 242.2 242.2 25.844 2.639e-06 *** Residuals 75 702.8 9.4 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > SSR SSR [1] 5315.008 The value obtained previously with the lm() and anova() functions for SSR = 5315.0081. Recall that SSR = βˆ X Y − n1 Y JY = Y (H − n1 J)Y

> n Y dim(Y) [1] 78

1

> X dim(X) [1] 78

K14521_SM-Color_Cover.indd 570

3

30/06/15 11:52 am

Chapter 12:

Regression

565

> H dim(H) [1] 78 78 > J dim(J) [1] 78 78 > SSR SSR [,1] [1,] 5315.008 Using quadratic forms, the value for SSR = 5315.0081, the same value computed from using R Code 12.1. 12. The data frame KINDER contains the height in inches and weight in pounds of 20 children from a kindergarten class. Use all 20 observations and construct a regression model where the results are stored in the object mod by regressing weight on height. (a) Create a scatterplot of weight versus height to verify a possible linear relationship between the two variables. (b) Compute and display the hat values for mod in a graph. Use the graph to identify the two largest hat values. Superimpose a horizontal line at 2p/n. Remove the values that exceed 2p/n and regress weight on height, storing the results in an object named modk. (c) Remove case 19 from the original data frame KINDER and regress weight on height, storing the results in modk19. Is the child with the largest hat value an influential observation if one considers the 19 observations without case 19 from the original data frame? Compute and consider Cook’s Di , DFFITSi , and DFBETASk(i) , in reaching a conclusion. Specifically, produce a graph showing hii , the differences in βˆ1(i) − βˆ1 , DF BET ASk(i) , studentized residuals, DF F IT Si , Cook’s Di , and a bubble-plot of studentized residuals versus leverage values with plotted points proportional to Cook’s distance along with the corresponding values that flag observations for further scrutiny assuming α = 0.10. Hint: Use the functions fortify() from the ggplot2 package and lm.influence(). (d) Remove case 20 from the data frame KINDER and regress weight on height, storing the results in modk20. Is the child with the largest hat value an influential observation if one considers the 19 observations without case 20 from the original data frame? Compute and consider Cook’s Di , DFFITSi , and DFBETASk(i) in reaching a conclusion. Specifically, produce a graph showing hii , the differences in βˆ1(i) − βˆ1 , DFBETASk(i) , studentized residuals, DFFITSi , Cook’s Di , and a bubble-plot of studentized residuals versus leverage values with plotted points proportional to Cook’s distance along with the corresponding values that flag observations for further scrutiny assuming α = 0.10. (e) Create a scatterplot showing all 20 children. Use a solid circle to identify case 19 and a solid triangle to identify case 20. Superimpose the lines for models mod (lty = 1), modk (lty = 2), mod19 (lty = 3), and mod20 (lty = 4).

K14521_SM-Color_Cover.indd 571

30/06/15 11:52 am

566

Probability and Statistics with R, Second Edition: Exercises and Solutions

Solution: (a) > ggplot(data = KINDER, aes(x = wt, y = ht)) + + geom_point() + theme_bw() + + labs(x = "Weight (pounds)", y = "Height (inches)")

Height (inches)

48

45

42

39 30

40

50

Weight (pounds)

60

Based on the scatterplot, a linear relationship between ht and wt appears reasonable; however, two points will bear further scrutiny. (b) > > > > + + + > >

mod > > > > > + + + > + + + + > > + + +

K14521_SM-Color_Cover.indd 573

modk19 +

K14521_SM-Color_Cover.indd 575

modk20 line2 coef(summary(line2)) Estimate Std. Error t value Pr(>|t|) (Intercept) -5373.946 35455.1129 -0.1515704 8.814204e-01 area 3183.757 433.2997 7.3477006 1.641688e-06 > line3 coef(summary(line3)) Estimate Std. Error t value Pr(>|t|) (Intercept) 95394.348 20555.5010 4.640818 4.254153e-05 area 1766.622 231.2392 7.639803 4.051498e-09 (c) > ggplot(data = VIT2005, aes(x = area, y = totalprice, + color = conservation1)) + + geom_point() + + geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) + + theme_bw() > rm(VIT2005) # Clean up

K14521_SM-Color_Cover.indd 579

30/06/15 11:52 am

574

Probability and Statistics with R, Second Edition: Exercises and Solutions 6e+05

5e+05

totalprice

conservation1

4e+05

A B C

3e+05

2e+05

80

120

area

160

Case Study: Biomass Data and ideas for this case study come from (Goicoa et al., 2011). 14. To estimate the amount of carbon dioxide retained in a tree, its biomass needs to be known and multiplied by an expansion factor (there are several alternatives in the literature). To calculate the biomass, specific regression equations by species are frequently used. These regression equations, called allometric equations, estimate the biomass of the tree by means of some known characteristics, typically diameter and/or height of the stem and branches. The BIOMASS file contains data of 42 beeches (Fagus Sylvatica) from a forest of Navarra (Spain) in 2006, where • diameter: diameter of the stem in centimeters • height: height of the tree in meters • stemweight: weight of the stem in kilograms • aboveweight: aboveground weight in kilograms (a) Create a scatterplot of aboveweight versus diameter. Superimpose a regression line over the plot just created.

Is the relationship linear?

(b) Create a scatterplot of log(aboveweight) versus log(diameter). Is the relationship linear? Superimpose a regression line over the plot just created. (c) Fit the regression model log(aboveweight) = β0 + β1 log(diameter), and compute R2 , Ra2 , and the variance of the residuals. (d) Introduce log(height) as an explanatory variable and fit the model log(aboveweight) = β0 + β1 log(diameter) + β2 log(height). What is the effect of introducing log(height) in the model?

K14521_SM-Color_Cover.indd 580

30/06/15 11:52 am

Chapter 12:

Regression

575

(e) Complete the Analysis questions for the model in (d). Analysis questions:

(1) Estimate the model’s parameters and their standard errors. Provide an interpretation for the model’s parameters. ˆ (2) Compute the variance-covariance matrix of the βs. (3) Provide 95% confidence intervals for β1 and β2 . (4) Compute the R2 , Ra2 , and the residual variance. (5) Construct a graph with the default diagnostics plots of R. (6) Can homogeneity of variance be assumed? (7) Do the residuals appear to follow a normal distribution? (8) Are there any outliers in the data? (9) Are there any influential observations in the data?

(f) Obtain predictions of the aboveground biomass of trees with diameters diameter = seq(12.5, 42.5, 5) and heights height = seq(10, 40, 5). Note that the weight predictions are obtained from back transforming the logarithm. The bias correction is obtained by means of the lognormal distribution: If Ypred is the prediction, the corrected (back-transformed) prediction Ypred is given by ˆ 2 /2) Ypred = exp(Ypred + σ where σ ˆ 2 is the variance of the error term.

Solution: (a) > ggplot(data = BIOMASS, aes(x = diameter, y = aboveweight)) + + geom_point() + + stat_smooth(method = "lm", se = FALSE) + + theme_bw()

K14521_SM-Color_Cover.indd 581

30/06/15 11:52 am

576

Probability and Statistics with R, Second Edition: Exercises and Solutions 4000

aboveweight

3000

2000

1000

0

20

40

diameter

60

The association between aboveweight and diameter is positive, the general form of the relationship is slightly curvilinear.

(b)

> ggplot(data = BIOMASS, aes(x = log(diameter), y = log(aboveweight))) + + geom_point() + + stat_smooth(method = "lm", se = FALSE) + + theme_bw()

K14521_SM-Color_Cover.indd 582

30/06/15 11:52 am

Chapter 12:

Regression

577

8

log(aboveweight)

7

6

5

4

2.5

3.0

3.5

log(diameter)

4.0

(c) > modlog summary(modlog) Call: lm(formula = log(aboveweight) ~ log(diameter), data = BIOMASS) Residuals: Min 1Q -0.48510 -0.12682

Median 0.02701

3Q 0.10766

Max 0.32104

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.5015 0.1920 -7.822 1.38e-09 *** log(diameter) 2.2806 0.0542 42.076 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1842 on 40 degrees of freedom Multiple R-squared: 0.9779,Adjusted R-squared: 0.9774 F-statistic: 1770 on 1 and 40 DF, p-value: < 2.2e-16 > > > >

r2 > > >

r2 vcov(modlogH) (Intercept) log(diameter) log(height) (Intercept) 0.1022432727 -0.0006066342 -0.032216921 log(diameter) -0.0006066342 0.0024646465 -0.002596642 log(height) -0.0322169214 -0.0025966423 0.013365634 (3) > CI CI 2.5 % 97.5 % (Intercept) -3.4238270 -2.1302958 log(diameter) 2.0773698 2.2782036 log(height) 0.2953374 0.7630233 The 95% confidence interval for β1 is [2.0774, 2.2782], and the 95% confidence interval for β2 is [0.2953, 0.763]. (4) > summary(modlogH) Call: lm(formula = log(aboveweight) ~ log(diameter) + log(height), data = BIOMASS) Residuals: Min 1Q Median -0.26519 -0.11243 -0.01637

3Q 0.07720

Max 0.38024

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.77706 0.31976 -8.685 1.18e-10 *** log(diameter) 2.17779 0.04965 43.867 < 2e-16 *** log(height) 0.52918 0.11561 4.577 4.71e-05 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1505 on 39 degrees of freedom Multiple R-squared: 0.9856,Adjusted R-squared: 0.9849 F-statistic: 1337 on 2 and 39 DF, p-value: < 2.2e-16 > > > >

K14521_SM-Color_Cover.indd 585

r2 hcv hcv

# 2*p/n # hii CV

[1] 0.1702128 > which(hatvalues(modelD) > hcv) 10 22 45 10 22 45 > outlierTest(modelD) No Studentized residuals with Bonferonni p < 0.05 Largest |rstudent|: rstudent unadjusted p-value Bonferonni p 46 -3.030668 0.0041662 0.19581

K14521_SM-Color_Cover.indd 608

30/06/15 11:53 am

Chapter 12:

Regression

603

There are three observations (10, 22, and 45) with a leverage value that exceeds 0.1702. (vii) > outlierTest(modelD) No Studentized residuals with Bonferonni p < 0.05 Largest |rstudent|: rstudent unadjusted p-value Bonferonni p 46 -3.030668 0.0041662 0.19581 There are no outliers according to a Bonferroni test. (viii) > > + + + + + > > >

DF CI CI 2.5 % 97.5 % (Intercept) -2012.6283481 2879.394461 fruit 0.7542251 1.091433 smallareaR67 -2116.2680000 3432.813298 smallareaR68 -5746.3330265 334.455286 The 95% confidence interval for the fruit coefficient is [0.7542, 1.0914]. (h) > coef(summary(modelD)) Estimate (Intercept) 433.3830564 fruit 0.9228291 smallareaR67 658.2726489 smallareaR68 -2705.9388703

Std. Error t value Pr(>|t|) 1.212883e+03 0.3573165 7.226027e-01 8.360421e-02 11.0380687 3.960570e-14 1.375788e+03 0.4784696 6.347399e-01 1.507614e+03 -1.7948481 7.970919e-02

> IOF IOF [1] 9228.291 Holding all other quantities in the model constant, an increase in fruit of 10, 000 m2 would increase the expected observed fruits by 9228.2905 m2 . (i) > newdata PredictEstimate PredictEstimate 1 2 3 89988.66 4503208.64 2658694.17 The predicted area of fruit trees for small areas R63, R67, and R68 are 89988.6641 m2 , 4503208.6404 m2 , and 2658694.1669 m2 . (j) The function ggplot() is used with the aesthetic color = smallarea to distinguish the small areas in the plot. The regression lines are nearly parallel suggesting a model with an identical slope but different intercepts for each small area may be reasonable. > ggplot(data = SATFRUIT, aes(x = fruit, y = observed, color = smallarea)) + + geom_point() + + stat_smooth(method = "lm", se = FALSE) + + theme_bw()

K14521_SM-Color_Cover.indd 610

30/06/15 11:53 am

Chapter 12:

Regression

605

10000

observed

smallarea R63 R67 R68

5000

0 0

5000

fruit

10000

(k) > ggplot(data = DF, aes(x = observed, y = .fitted, color = smallarea)) + + geom_point() + + theme_bw() + + geom_abline(intercept = 0, slope = 1, lty = "dashed") + + labs(x = "Observed", y = "Fitted")

10000

smallarea

Fitted

R63 R67 R68

5000

0

0

5000

Observed

10000

A straight line appears to model the relationship between fitted and observed values. (l) Recall that the direct technique estimates the total surface area by multiplying the mean of the observed surface area in the sampled segments by the total number of segments in every small area. The direct and model estimates initially in m2 are converted to hectares by dividing each estimate by 10,000. > DirectEstimate DirectEstimate R63 R67 R68 198466.5 5867470.0 3589159.5

K14521_SM-Color_Cover.indd 611

30/06/15 11:53 am

606 > + > > > >

Probability and Statistics with R, Second Edition: Exercises and Solutions

newdata > >

mod.glm > >

ggplot(data = VIT2005, aes(x = totalprice)) + geom_density(fill = "pink") + theme_bw() MD scatterplotMatrix(~totalprice + toilets + garage + elevator + + storage, data = VIT2005)

K14521_SM-Color_Cover.indd 616

30/06/15 11:53 am

Chapter 12:

60

Regression

100 140 180

2

611

4

6

8

12

2e+05

4e+05

totalprice

120

60

100 140 180

area

12

0

40

80

age

2

4

6

8

floor

3

4

5

6

7

rooms

2e+05

K14521_SM-Color_Cover.indd 617

4e+05

0

40

80

120

3

4

5

6

7

30/06/15 11:53 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

612

1.0

1.4

1.8

0.0

0.4

0.8

2e+05

4e+05

totalprice

2.0

1.0

1.4

1.8

toilets

0.0

1.0

garage

2.0

0.0

0.4

0.8

elevator

0.0

1.0

storage

2e+05

4e+05

0.0

1.0

2.0

0.0

1.0

2.0

The variable totalprice appears to have a moderate linear relationship with area. (c) > NUM COR COR area age floor rooms toilets garage [1,] 0.8092125 -0.2724497 0.02921993 0.525627 0.6875706 0.5237425 elevator storage [1,] 0.5109393 0.2673579 The highest three correlations with totalprice occur with area (0.8092), toilets (0.6876), and rooms (0.5256). Model (A) The functions drop1() and update() are used to create a model using backward elimination.

K14521_SM-Color_Cover.indd 618

30/06/15 11:53 am

Chapter 12:

Regression

613

> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + age + floor + rooms + out + conservation + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

8.5891e+10 4412.6 area 1 4.4519e+10 1.3041e+11 4501.7 87.5972 < 2.2e-16 *** zone 22 1.1171e+11 1.9760e+11 4550.3 9.9910 < 2.2e-16 *** category 6 9.5199e+09 9.5411e+10 4423.5 3.1219 0.006303 ** age 1 3.7563e+06 8.5895e+10 4410.6 0.0074 0.931591 floor 1 3.8440e+07 8.5929e+10 4410.7 0.0756 0.783639 rooms 1 1.0656e+09 8.6956e+10 4413.3 2.0967 0.149472 out 3 3.8946e+09 8.9785e+10 4416.3 2.5544 0.057135 . conservation 3 1.0031e+09 8.6894e+10 4409.2 0.6579 0.579069 toilets 1 4.7971e+09 9.0688e+10 4422.5 9.4389 0.002477 ** garage 1 1.4771e+10 1.0066e+11 4445.2 29.0627 2.328e-07 *** elevator 1 5.4265e+09 9.1317e+10 4424.0 10.6772 0.001314 ** streetcategory 3 3.5550e+09 8.9446e+10 4415.5 2.3316 0.076019 . heating 3 4.3202e+09 9.0211e+10 4417.3 2.8335 0.039877 * storage 1 1.6433e+09 8.7534e+10 4414.8 3.2334 0.073937 . --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + floor + rooms + out + conservation + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

8.5895e+10 4410.6 area 1 4.4894e+10 1.3079e+11 4500.3 88.8523 < 2.2e-16 *** zone 22 1.1220e+11 1.9810e+11 4548.8 10.0942 < 2.2e-16 *** category 6 9.6605e+09 9.5555e+10 4421.9 3.1866 0.0054598 ** floor 1 3.6864e+07 8.5931e+10 4408.7 0.0730 0.7874016 rooms 1 1.0655e+09 8.6960e+10 4411.3 2.1088 0.1482969 out 3 4.0439e+09 8.9938e+10 4414.7 2.6678 0.0493562 * conservation 3 1.3497e+09 8.7244e+10 4408.0 0.8905 0.4473466 toilets 1 4.7993e+09 9.0694e+10 4420.5 9.4987 0.0023994 ** garage 1 1.4767e+10 1.0066e+11 4443.2 29.2260 2.153e-07 *** elevator 1 5.6740e+09 9.1569e+10 4422.6 11.2298 0.0009919 *** streetcategory 3 3.5550e+09 8.9450e+10 4413.5 2.3453 0.0746761 . heating 3 4.3716e+09 9.0266e+10 4415.5 2.8840 0.0373395 *

K14521_SM-Color_Cover.indd 619

30/06/15 11:53 am

614

Probability and Statistics with R, Second Edition: Exercises and Solutions

storage --Signif. codes:

1 1.6950e+09 8.7590e+10 4412.9

3.3547 0.0687625 .

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + rooms + out + conservation toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

8.5931e+10 4408.7 area 1 4.4857e+10 1.3079e+11 4498.3 89.2643 < 2.2e-16 zone 22 1.1713e+11 2.0306e+11 4552.2 10.5949 < 2.2e-16 category 6 9.8951e+09 9.5826e+10 4420.5 3.2818 0.0044204 rooms 1 1.0703e+09 8.7002e+10 4409.4 2.1299 0.1462887 out 3 4.0364e+09 8.9968e+10 4412.7 2.6774 0.0487312 conservation 3 1.3563e+09 8.7288e+10 4406.1 0.8997 0.4426615 toilets 1 4.8110e+09 9.0742e+10 4418.6 9.5738 0.0023062 garage 1 1.4733e+10 1.0066e+11 4441.2 29.3185 2.054e-07 elevator 1 5.7376e+09 9.1669e+10 4420.8 11.4175 0.0009011 streetcategory 3 3.5188e+09 8.9450e+10 4411.5 2.3341 0.0757310 heating 3 4.4146e+09 9.0346e+10 4413.6 2.9283 0.0352446 storage 1 1.6588e+09 8.7590e+10 4410.9 3.3010 0.0709881 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

+

*** *** ** * ** *** *** . * .

> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + rooms + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

8.7288e+10 4406.1 area 1 4.4067e+10 1.3135e+11 4493.2 87.8431 < 2.2e-16 zone 22 1.1785e+11 2.0514e+11 4548.4 10.6787 < 2.2e-16 category 6 1.2678e+10 9.9966e+10 4423.7 4.2122 0.0005529 rooms 1 1.0246e+09 8.8312e+10 4406.7 2.0425 0.1547526 out 3 4.5287e+09 9.1816e+10 4411.2 3.0092 0.0316923 toilets 1 5.1432e+09 9.2431e+10 4416.6 10.2525 0.0016231 garage 1 1.5621e+10 1.0291e+11 4440.0 31.1392 9.06e-08 elevator 1 5.7882e+09 9.3076e+10 4418.1 11.5382 0.0008449 streetcategory 3 3.5484e+09 9.0836e+10 4408.8 2.3578 0.0734020 heating 3 3.9987e+09 9.1286e+10 4409.9 2.6570 0.0499673 storage 1 1.6695e+09 8.8957e+10 4408.3 3.3280 0.0698262 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

K14521_SM-Color_Cover.indd 620

*** *** *** * ** *** *** . * .

30/06/15 11:53 am

Chapter 12:

Regression

615

> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

8.8312e+10 4406.7 area 1 6.3300e+10 1.5161e+11 4522.5 125.4358 < 2.2e-16 zone 22 1.1695e+11 2.0526e+11 4546.5 10.5341 < 2.2e-16 category 6 1.2113e+10 1.0043e+11 4422.7 4.0004 0.0008860 out 3 4.8644e+09 9.3177e+10 4412.4 3.2131 0.0243109 toilets 1 5.4584e+09 9.3771e+10 4417.8 10.8163 0.0012163 garage 1 1.5751e+10 1.0406e+11 4440.5 31.2116 8.718e-08 elevator 1 6.3078e+09 9.4620e+10 4419.7 12.4996 0.0005209 streetcategory 3 3.3915e+09 9.1704e+10 4408.9 2.2402 0.0852905 heating 3 4.1435e+09 9.2456e+10 4410.7 2.7369 0.0450560 storage 1 1.5008e+09 8.9813e+10 4408.4 2.9740 0.0863783 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

*** *** *** * ** *** *** . * .

> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating Df Sum of Sq RSS AIC F value Pr(>F)

8.9813e+10 4408.4 area 1 6.4458e+10 1.5427e+11 4524.3 126.3141 < 2.2e-16 zone 22 1.1977e+11 2.0959e+11 4549.1 10.6688 < 2.2e-16 category 6 1.1488e+10 1.0130e+11 4422.6 3.7521 0.001540 out 3 4.9656e+09 9.4779e+10 4414.1 3.2436 0.023353 toilets 1 5.6974e+09 9.5511e+10 4419.8 11.1647 0.001018 garage 1 1.6124e+10 1.0594e+11 4442.4 31.5962 7.32e-08 elevator 1 6.8119e+09 9.6625e+10 4422.3 13.3488 0.000341 streetcategory 3 4.5260e+09 9.4339e+10 4413.1 2.9564 0.033902 heating 3 4.2935e+09 9.4107e+10 4412.5 2.8046 0.041267 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

*** *** ** * ** *** *** * *

> formula(model.be) totalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating > modelA > > >

modelAg >

set.seed(5) cv.error5 >

mgof >

library(MASS) SCOPE + > > >

modelC SCOPE mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ 1 Df Sum of Sq RSS AIC F value Pr(>F)

1.0421e+12 4860.7 area 1 6.8239e+11 3.5970e+11 4630.8 409.7694 < 2.2e-16 zone 22 6.2786e+11 4.1424e+11 4703.6 13.4346 < 2.2e-16 category 6 3.8142e+11 6.6068e+11 4773.4 20.3020 < 2.2e-16 age 1 7.7353e+10 9.6474e+11 4845.9 17.3190 4.563e-05 floor 1 8.8974e+08 1.0412e+12 4862.5 0.1846 0.6678953 rooms 1 2.8791e+11 7.5418e+11 4792.2 82.4595 < 2.2e-16 out 3 1.5831e+10 1.0263e+12 4863.4 1.1004 0.3499449 conservation 3 9.4240e+10 9.4785e+11 4846.1 7.0923 0.0001449 toilets 1 4.9265e+11 5.4944e+11 4723.2 193.6754 < 2.2e-16 garage 1 2.8585e+11 7.5624e+11 4792.8 81.6462 < 2.2e-16 elevator 1 2.7205e+11 7.7005e+11 4796.8 76.3102 6.764e-16 streetcategory 3 1.2246e+11 9.1963e+11 4839.5 9.4988 6.440e-06 heating 3 1.6150e+11 8.8059e+11 4830.0 13.0827 7.067e-08 storage 1 7.4489e+10 9.6760e+11 4846.6 16.6283 6.393e-05 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

*** *** *** *** *** *** *** *** *** *** *** ***

> mod.fs add1(mod.fs, scope = SCOPE, test = "F")

K14521_SM-Color_Cover.indd 632

30/06/15 11:53 am

Chapter 12:

Regression

627

Single term additions Model: totalprice ~ area Df

zone 22 category 6 age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 garage 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0

Sum of Sq 1.7960e+11 9.3970e+10 5.5354e+10 1.4251e+09 1.4929e+08 5.1247e+09 3.6835e+10 5.6364e+10 6.8073e+10 4.5308e+10 1.5886e+09 2.5654e+10 2.2452e+10

RSS 3.5970e+11 1.8010e+11 2.6573e+11 3.0435e+11 3.5828e+11 3.5956e+11 3.5458e+11 3.2287e+11 3.0334e+11 2.9163e+11 3.1440e+11 3.5812e+11 3.3405e+11 3.3725e+11

AIC 4630.8 4524.0 4576.8 4596.4 4632.0 4632.8 4633.7 4613.3 4595.7 4587.1 4603.5 4635.9 4620.7 4618.8

F value

Pr(>F)

8.7936 12.3768 39.1032 0.8552 0.0893 1.0261 8.1001 39.9496 50.1857 30.9839 0.3149 5.4526 14.3132

< 2.2e-16 6.363e-12 2.137e-09 0.3561201 0.7653963 0.3819282 3.915e-05 1.482e-09 1.968e-11 7.706e-08 0.8145660 0.0012481 0.0002007

*** *** ***

*** *** *** *** ** ***

'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area Df

category 6 age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 garage 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0

+ zone Sum of Sq 3.7898e+10 1.1794e+10 2.1117e+08 2.4153e+09 1.0513e+09 1.5083e+10 2.8406e+10 3.2856e+10 2.0964e+10 7.0112e+09 1.2792e+10 7.0505e+09

RSS 1.8010e+11 1.4221e+11 1.6831e+11 1.7989e+11 1.7769e+11 1.7905e+11 1.6502e+11 1.5170e+11 1.4725e+11 1.5914e+11 1.7309e+11 1.6731e+11 1.7305e+11

AIC 4524.0 4484.5 4511.3 4525.8 4523.1 4528.8 4511.0 4488.6 4482.1 4499.1 4521.4 4514.0 4517.3

F value

Pr(>F)

8.3504 13.5237 0.2266 2.6235 0.3738 5.8192 36.1406 43.0642 25.4251 2.5789 4.8678 7.8632

4.949e-08 0.0003053 0.6346290 0.1069270 0.7719787 0.0007961 9.014e-09 4.758e-10 1.056e-06 0.0549525 0.0027611 0.0055610

*** ***

*** *** *** *** . ** **

'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model:

K14521_SM-Color_Cover.indd 633

30/06/15 11:53 am

628

Probability and Statistics with R, Second Edition: Exercises and Solutions

totalprice ~ area Df

category 6 age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0

+ zone + garage Sum of Sq RSS 1.4725e+11 2.6933e+10 1.2032e+11 8.0663e+09 1.3918e+11 2.8860e+08 1.4696e+11 2.2030e+09 1.4505e+11 2.0387e+09 1.4521e+11 9.5253e+09 1.3772e+11 1.6516e+10 1.3073e+11 1.6358e+10 1.3089e+11 6.5779e+09 1.4067e+11 1.2552e+10 1.3470e+11 5.1715e+09 1.4208e+11

AIC 4482.1 4450.1 4471.9 4483.7 4480.8 4485.1 4473.6 4458.2 4458.5 4478.2 4468.7 4476.3

F value

Pr(>F)

6.9769 11.1273 0.3771 2.9162 0.8892 4.3803 24.2567 23.9950 2.9616 5.9016 6.9886

1.040e-06 0.0010212 0.5399126 0.0893099 0.4477641 0.0052350 1.813e-06 2.046e-06 0.0334679 0.0007162 0.0088808

*** ** . ** *** *** * *** **

'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area Df

age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0

+ zone + garage + category Sum of Sq RSS AIC F value Pr(>F) 1.2032e+11 4450.1 1.7261e+09 1.1859e+11 4448.9 2.7073 0.101580 9.2505e+05 1.2031e+11 4452.1 0.0014 0.969875 2.5576e+09 1.1776e+11 4447.4 4.0398 0.045883 * 5.6366e+09 1.1468e+11 4445.6 3.0146 0.031318 * 1.5615e+09 1.1875e+11 4453.2 0.8065 0.491749 6.6314e+09 1.1368e+11 4439.7 10.8497 0.001183 ** 1.0192e+10 1.1012e+11 4432.8 17.2148 5.074e-05 *** 5.2963e+09 1.1502e+11 4446.3 2.8242 0.040097 * 7.3042e+09 1.1301e+11 4442.4 3.9641 0.009074 ** 4.2321e+09 1.1608e+11 4444.3 6.7812 0.009957 ** '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area + zone + garage + category + elevator Df Sum of Sq RSS AIC F value Pr(>F)

1.1012e+11 4432.8 age 1 815586409 1.0931e+11 4433.2 1.3804 0.241550 floor 1 27424777 1.1010e+11 4434.7 0.0461 0.830261 rooms 1 1548114686 1.0857e+11 4431.7 2.6378 0.106049 out 3 4328922592 1.0579e+11 4430.1 2.4960 0.061298 .

K14521_SM-Color_Cover.indd 634

30/06/15 11:53 am

Chapter 12: conservation toilets streetcategory heating storage --Signif. codes:

3 1 3 3 1

1689461578 6073243761 5617271088 5056597855 3694847210

Regression

1.0843e+11 1.0405e+11 1.0451e+11 1.0507e+11 1.0643e+11

629

4435.4 0.9504 0.417449 4422.4 10.7982 0.001216 ** 4427.4 3.2788 0.022218 * 4428.6 2.9358 0.034710 * 4427.4 6.4226 0.012097 *

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area Df

age 1 floor 1 rooms 1 out 3 conservation 3 streetcategory 3 heating 3 storage 1 --Signif. codes: 0

+ zone + garage + category + elevator + toilets Sum of Sq RSS AIC F value Pr(>F) 1.0405e+11 4422.4 822533955 1.0323e+11 4422.7 1.4661 0.22751 29736792 1.0402e+11 4424.4 0.0526 0.81885 1211766335 1.0284e+11 4421.9 2.1681 0.14261 4583711290 9.9466e+10 4418.6 2.7957 0.04164 * 1482925524 1.0257e+11 4425.3 0.8771 0.45405 5418285673 9.8631e+10 4416.8 3.3327 0.02072 * 4760850186 9.9289e+10 4418.2 2.9089 0.03596 * 3309776001 1.0074e+11 4417.4 6.0453 0.01487 * '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area storage Df

age 1 floor 1 rooms 1 out 3 conservation 3 streetcategory 3 heating 3 --Signif. codes: 0

+ zone + garage + category + elevator + toilets + Sum of Sq 372189171 21742002 1463125127 4307912248 1367752167 3975957762 4211586990

RSS 1.0074e+11 1.0037e+11 1.0072e+11 9.9277e+10 9.6432e+10 9.9372e+10 9.6764e+10 9.6528e+10

AIC F value Pr(>F) 4417.4 4418.6 0.6786 0.41114 4419.3 0.0395 0.84267 4416.2 2.6970 0.10225 4413.9 2.6953 0.04744 * 4420.4 0.8304 0.47872 4414.6 2.4791 0.06269 . 4414.1 2.6324 0.05145 .

'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions

K14521_SM-Color_Cover.indd 635

30/06/15 11:53 am

630

Probability and Statistics with R, Second Edition: Exercises and Solutions

Model: totalprice ~ area storage + out Df

age 1 floor 1 rooms 1 conservation 3 streetcategory 3 heating 3 --Signif. codes: 0

+ zone + garage + category + elevator + toilets + Sum of Sq 13949063 15157903 1036260293 866974136 3976103810 4728065085

RSS 9.6432e+10 9.6418e+10 9.6417e+10 9.5396e+10 9.5565e+10 9.2456e+10 9.1704e+10

AIC F value Pr(>F) 4413.9 4415.8 0.0260 0.87198 4415.8 0.0283 0.86660 4413.5 1.9553 0.16374 4417.9 0.5383 0.65666 4410.7 2.5517 0.05715 . 4408.9 3.0591 0.02964 *

'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> formula(mod.fs) totalprice ~ area + zone + garage + category + elevator + toilets + storage + out > modelD + > + > > >

modelD

set.seed(5) cv.error5 > > >

residualPlot(modelA, residualPlot(modelB, residualPlot(modelC, residualPlot(modelD,

main main main main

= = = =

"Model "Model "Model "Model

A") B") C") D") Model B 60000 20000 −60000

−20000

Pearson residuals

20000 −20000 −60000

Pearson residuals

60000

Model A

250000

350000

450000

150000

250000

350000

Fitted values

Fitted values

Model C

Model D

450000

250000

350000

0 20000 −80000

150000

−40000

Pearson residuals

0 −50000

Pearson residuals

50000

60000

150000

450000

150000

250000

Fitted values

350000

450000

Fitted values

The residuals versus the fitted values for Models (A), (B), (C), and (D) all have a definite curvature indicating none of the models are adequate. (e) > > > >

K14521_SM-Color_Cover.indd 637

boxCox(modelA, boxCox(modelB, boxCox(modelC, boxCox(modelD,

lambda lambda lambda lambda

= = = =

seq(-0.5, seq(-0.5, seq(-0.5, seq(-0.5,

0.5, 0.5, 0.5, 0.5,

length length length length

= = = =

200)) 200)) 200)) 200))

30/06/15 11:53 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

−2728 −2730

95%

−2734

−2732

log−Likelihood

−2735

−2733

95%

−2737

log−Likelihood

−2731

632

−0.4

−0.2

0.0

0.2

0.4

−0.4

−0.2

0.0

0.4

0.2

0.4

−2737

λ

−2739

95%

−2741

−2759

log−Likelihood

95%

−2761

log−Likelihood

−2757

λ

0.2

−0.4

−0.2

0.0

0.2

0.4

−0.4

−0.2

0.0

λ

λ

A log transformation is suggested for the response totalprice in each model. Model (E) The functions drop1() and update() are used to create a model using backward elimination. > VIT2005$logtotalprice model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + age + floor out + conservation + toilets + garage + elevator heating + storage Df Sum of Sq RSS AIC F value

0.97895 -1080.46 area 1 0.45808 1.43703 -998.78 79.0794 zone 22 1.25852 2.23747 -944.25 9.8756

K14521_SM-Color_Cover.indd 638

+ rooms + + streetcategory + Pr(>F) 8.788e-16 *** < 2.2e-16 ***

30/06/15 11:53 am

Chapter 12: category age floor rooms out conservation toilets garage elevator streetcategory heating storage --Signif. codes:

6 1 1 1 3 3 1 1 1 3 3 1

0.10125 0.00061 0.00185 0.00422 0.06721 0.01134 0.08460 0.14304 0.14529 0.03040 0.04007 0.03280

1.08020 0.97956 0.98080 0.98317 1.04616 0.99029 1.06356 1.12199 1.12424 1.00935 1.01902 1.01175

Regression

633

-1071.00 2.9132 0.009938 ** -1082.32 0.1045 0.746909 -1082.05 0.3191 0.572909 -1081.52 0.7279 0.394766 -1071.98 3.8675 0.010426 * -1083.95 0.6523 0.582546 -1064.39 14.6057 0.000186 *** -1052.73 24.6935 1.636e-06 *** -1052.29 25.0826 1.373e-06 *** -1079.79 1.7492 0.158880 -1077.71 2.3060 0.078550 . -1075.27 5.6625 0.018445 *

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + floor + rooms + out + conservation + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

0.97956 -1082.32 area 1 0.46618 1.44574 -999.46 80.9051 4.535e-16 *** zone 22 1.26736 2.24692 -945.34 9.9977 < 2.2e-16 *** category 6 0.10075 1.08031 -1072.98 2.9142 0.0099017 ** floor 1 0.00202 0.98158 -1083.87 0.3509 0.5543866 rooms 1 0.00422 0.98378 -1083.39 0.7322 0.3933744 out 3 0.06702 1.04658 -1073.90 3.8772 0.0102869 * conservation 3 0.01128 0.99084 -1085.83 0.6525 0.5824434 toilets 1 0.08451 1.06406 -1066.28 14.6657 0.0001803 *** garage 1 0.14340 1.12296 -1054.54 24.8869 1.492e-06 *** elevator 1 0.14688 1.12643 -1053.87 25.4902 1.136e-06 *** streetcategory 3 0.03151 1.01107 -1081.42 1.8229 0.1448410 heating 3 0.04185 1.02140 -1079.20 2.4207 0.0678034 . storage 1 0.03220 1.01175 -1077.27 5.5875 0.0192183 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + floor + rooms + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

K14521_SM-Color_Cover.indd 639

30/06/15 11:53 am

634

Probability and Statistics with R, Second Edition: Exercises and Solutions

0.99084 -1085.8 area 1 0.46386 1.45470 -1004.1 zone 22 1.27513 2.26596 -949.5 category 6 0.12795 1.11878 -1071.3 floor 1 0.00212 0.99295 -1087.4 rooms 1 0.00384 0.99468 -1087.0 out 3 0.07314 1.06397 -1076.3 toilets 1 0.08930 1.08013 -1069.0 garage 1 0.15229 1.14313 -1056.7 elevator 1 0.14955 1.14039 -1057.2 streetcategory 3 0.03302 1.02385 -1084.7 heating 3 0.03941 1.03025 -1083.3 storage 1 0.03105 1.02189 -1081.1 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*'

80.9902 10.1199 3.7233 0.3700 0.6706 4.2565 15.5913 26.5903 26.1122 1.9216 2.2939 5.4221

3.978e-16 < 2.2e-16 0.0016539 0.5437901 0.4139769 0.0062592 0.0001142 6.824e-07 8.459e-07 0.1278696 0.0796834 0.0210400

*** *** **

** *** *** *** . *

0.05 '.' 0.1 ' ' 1

> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + rooms + out + toilets garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

0.99295 -1087.36 area 1 0.46227 1.45522 -1006.04 81.0051 3.828e-16 zone 22 1.32477 2.31772 -946.57 10.5521 < 2.2e-16 category 6 0.13186 1.12482 -1072.18 3.8512 0.0012404 rooms 1 0.00390 0.99685 -1088.51 0.6826 0.4098210 out 3 0.07320 1.06615 -1077.86 4.2757 0.0060983 toilets 1 0.08971 1.08266 -1070.51 15.7202 0.0001071 garage 1 0.15125 1.14420 -1058.45 26.5036 7.058e-07 elevator 1 0.15179 1.14474 -1058.35 26.5984 6.764e-07 streetcategory 3 0.03140 1.02435 -1086.57 1.8340 0.1427479 heating 3 0.04104 1.03400 -1084.53 2.3975 0.0697654 storage 1 0.02921 1.02217 -1083.04 5.1190 0.0249012 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

+

*** *** ** ** *** *** *** . *

> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)

0.99685 -1088.51 area 1 0.63037 1.62722 -983.68 110.6630 < 2.2e-16 *** zone 22 1.32105 2.31790 -948.56 10.5415 < 2.2e-16 ***

K14521_SM-Color_Cover.indd 640

30/06/15 11:53 am

Chapter 12: category out toilets garage elevator streetcategory heating storage --Signif. codes:

6 3 1 1 1 3 3 1

0.12882 0.07696 0.09245 0.15205 0.15775 0.03042 0.04262 0.02789

1.12567 1.07381 1.08930 1.14890 1.15460 1.02727 1.03947 1.02474

Regression -1074.01 -1078.30 -1071.17 -1059.56 -1058.48 -1087.95 -1085.38 -1084.49

635

3.7690 0.001487 ** 4.5036 0.004526 ** 16.2297 8.353e-05 *** 26.6921 6.452e-07 *** 27.6935 4.121e-07 *** 1.7802 0.152691 2.4940 0.061618 . 4.8962 0.028210 *

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + storage Df Sum of Sq RSS AIC F value Pr(>F)

1.0395 -1085.38 area 1 0.67195 1.7114 -978.68 115.0658 < 2.2e-16 *** zone 22 1.29885 2.3383 -952.64 10.1099 < 2.2e-16 *** category 6 0.15896 1.1984 -1066.36 4.5369 0.0002631 *** out 3 0.07346 1.1129 -1076.50 4.1930 0.0067656 ** toilets 1 0.09746 1.1369 -1067.84 16.6896 6.648e-05 *** garage 1 0.14218 1.1817 -1059.43 24.3479 1.838e-06 *** elevator 1 0.19646 1.2359 -1049.64 33.6422 2.967e-08 *** streetcategory 3 0.03627 1.0757 -1083.90 2.0703 0.1058248 storage 1 0.03132 1.0708 -1080.91 5.3633 0.0217043 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > formula(model.be) logtotalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + storage > modelE modelEg cv.errorN CVNe CVNe [1] 0.007113846 > set.seed(5)

K14521_SM-Color_Cover.indd 641

30/06/15 11:53 am

636

Probability and Statistics with R, Second Edition: Exercises and Solutions

> cv.error5 CV5e CV5e [1] 0.007780027 The CV n = 0.0071 for Model (E), and CV 5 = 0.0078 for Model (E). (ii) > MGOF MGOF R2 0.91673861

R2.adj AIC BIC 0.89849594 -464.72399549 -325.95969791

SE 0.07641801

The R2 , Ra2 , AIC, BIC, and standard error for modelE are 0.9167, 0.8985, -464.724, 325.9597, and 0.0764, respectively. The total proportion of variability explained by modelE is 0.9167. Model (F) The function stepAIC() from the MASS package is used find a model using the AIC criterion. > SCOPE mod.fs modelF formula(modelF)

K14521_SM-Color_Cover.indd 646

30/06/15 11:53 am

Chapter 12:

Regression

641

logtotalprice ~ area + zone + elevator + toilets + garage + category + out + storage + heating + streetcategory The AIC criterion suggests a model with variables area, zone, elevator, toilets, garage, category, out, storage, heating, and streetcategory. (i) > > > >

modelFg >

set.seed(5) cv.error5 SCOPE mod.fs modelG formula(modelG) logtotalprice ~ area + elevator + garage + zone + toilets + storage The BIC criterion suggests a model with variables area, elevator, garage, zone, toilets, and storage. (i) > > > >

modelGg >

set.seed(5) cv.error5 SCOPE mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ 1 Df Sum of Sq RSS AIC F value Pr(>F)

12.4844 -621.48 area 1 7.8362 4.6482 -834.87 364.1448 < 2.2e-16 *** zone 22 7.5626 4.9218 -780.40 13.6196 < 2.2e-16 *** category 6 4.8773 7.6071 -717.48 22.5468 < 2.2e-16 *** age 1 1.2432 11.2412 -642.35 23.8875 1.990e-06 *** floor 1 0.0307 12.4537 -620.02 0.5324 0.4664 rooms 1 3.3932 9.0912 -688.63 80.6198 < 2.2e-16 *** out 3 0.1954 12.2890 -618.92 1.1344 0.3361 conservation 3 1.3130 11.1715 -639.71 8.3836 2.705e-05 *** toilets 1 6.1982 6.2862 -769.06 212.9786 < 2.2e-16 *** garage 1 3.3694 9.1150 -688.06 79.8464 < 2.2e-16 *** elevator 1 4.0709 8.4135 -705.52 104.5141 < 2.2e-16 *** streetcategory 3 1.3194 11.1650 -639.83 8.4299 2.548e-05 *** heating 3 2.0692 10.4152 -654.99 14.1722 1.850e-08 *** storage 1 1.0175 11.4669 -638.02 19.1668 1.866e-05 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area

K14521_SM-Color_Cover.indd 651

30/06/15 11:53 am

646

Probability and Statistics with R, Second Edition: Exercises and Solutions

Df Sum of Sq RSS

4.6482 zone 22 2.38890 2.2593 category 6 1.39298 3.2552 age 1 0.94091 3.7073 floor 1 0.00287 4.6453 rooms 1 0.00522 4.6430 out 3 0.06955 4.5787 conservation 3 0.56107 4.0871 toilets 1 0.89664 3.7516 garage 1 0.82734 3.8209 elevator 1 0.98718 3.6610 streetcategory 3 0.01528 4.6329 heating 3 0.41740 4.2308 storage 1 0.35116 4.2970 --Signif. codes: 0 '***' 0.001 '**'

AIC -834.87 -948.14 -900.52 -882.17 -833.00 -833.11 -832.15 -856.91 -879.59 -875.60 -884.91 -829.58 -849.38 -849.99

F value

Pr(>F)

9.3240 14.9772 54.5666 0.1327 0.2418 1.0785 9.7466 51.3859 46.5544 57.9741 0.2342 7.0048 17.5699

< 2.2e-16 3.035e-14 3.274e-12 0.7159579 0.6234242 0.3591391 4.708e-06 1.200e-11 8.921e-11 8.291e-13 0.8724910 0.0001628 4.046e-05

*** *** ***

*** *** *** *** *** ***

0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone Df Sum of Sq RSS

2.2593 category 6 0.49734 1.7620 age 1 0.15632 2.1030 floor 1 0.00580 2.2535 rooms 1 0.02241 2.2369 out 3 0.02187 2.2374 conservation 3 0.19686 2.0624 toilets 1 0.40710 1.8522 garage 1 0.36212 1.8972 elevator 1 0.42018 1.8391 streetcategory 3 0.06860 2.1907 heating 3 0.17104 2.0883 storage 1 0.11314 2.1462 --Signif. codes: 0 '***' 0.001 '**'

AIC -948.14 -990.34 -961.77 -946.70 -948.31 -944.26 -962.01 -989.45 -984.22 -991.00 -948.86 -959.30 -957.34

F value

Pr(>F)

8.8444 14.3460 0.4964 1.9331 0.6223 6.0770 42.4204 36.8384 44.0942 1.9937 5.2147 10.1740

1.685e-08 0.0002031 0.4819483 0.1660184 0.6014037 0.0005689 6.230e-10 6.671e-09 3.098e-10 0.1163357 0.0017533 0.0016624

*** ***

*** *** *** *** ** **

0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets Df Sum of Sq RSS AIC F value Pr(>F)

1.8522 -989.45 category 6 0.26912 1.5831 -1011.68 5.2984 4.544e-05 ***

K14521_SM-Color_Cover.indd 652

30/06/15 11:53 am

Chapter 12: age floor rooms out conservation garage elevator streetcategory heating storage --Signif. codes:

1 1 1 3 3 1 1 3 3 1

0.10365 0.00599 0.01137 0.04104 0.10036 0.21264 0.32835 0.05361 0.12343 0.08179

Regression

647

1.7486 -1000.00 11.3811 0.0008975 *** 1.8462 -988.16 0.6226 0.4310439 1.8408 -988.79 1.1855 0.2776021 1.8112 -988.33 1.4350 0.2339420 1.7518 -995.59 3.6282 0.0140205 * 1.6396 -1014.03 24.9012 1.348e-06 *** 1.5238 -1029.99 41.3709 9.772e-10 *** 1.7986 -989.85 1.8878 0.1330509 1.7288 -998.49 4.5220 0.0043477 ** 1.7704 -997.30 8.8701 0.0032727 **

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets + elevator Df Sum of Sq RSS AIC F value Pr(>F)

1.5238 -1030.0 category 6 0.178990 1.3449 -1045.2 4.1258 0.0006487 age 1 0.048398 1.4754 -1035.0 6.2652 0.0131521 floor 1 0.001602 1.5223 -1028.2 0.2009 0.6544680 rooms 1 0.001585 1.5223 -1028.2 0.1989 0.6561194 out 3 0.037545 1.4863 -1029.4 1.5914 0.1928886 conservation 3 0.081656 1.4422 -1036.0 3.5670 0.0151983 garage 1 0.178658 1.3452 -1055.2 25.3672 1.093e-06 streetcategory 3 0.062161 1.4617 -1033.1 2.6792 0.0482952 heating 3 0.092733 1.4311 -1037.7 4.0822 0.0077429 storage 1 0.070407 1.4534 -1038.3 9.2524 0.0026825 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '

*** *

* *** * ** ** 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone Df Sum of Sq

category 6 0.142650 age 1 0.036534 floor 1 0.002271 rooms 1 0.002035 out 3 0.045758 conservation 3 0.061414 streetcategory 3 0.058889 heating 3 0.094803

K14521_SM-Color_Cover.indd 653

+ toilets + elevator + RSS AIC F value 1.3452 -1055.2 1.2025 -1067.6 3.6576 1.3087 -1059.2 5.3042 1.3429 -1053.5 0.3213 1.3432 -1053.5 0.2879 1.2994 -1056.7 2.2067 1.2838 -1059.4 2.9979 1.2863 -1058.9 2.8690 1.2504 -1065.1 4.7513

garage Pr(>F) 0.001865 0.022357 0.571477 0.592210 0.088738 0.031950 0.037776 0.003227

** *

. * * **

30/06/15 11:53 am

648

Probability and Statistics with R, Second Edition: Exercises and Solutions

storage --Signif. codes:

1

0.060346 1.2849 -1063.2

8.9239 0.003186 **

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets + elevator + garage + category Df Sum of Sq RSS AIC F value Pr(>F)

1.2025 -1067.6 age 1 0.006361 1.1962 -1066.8 0.9784 0.323890 floor 1 0.000015 1.2025 -1065.6 0.0024 0.961383 rooms 1 0.006518 1.1960 -1066.8 1.0028 0.317944 out 3 0.078212 1.1243 -1076.3 4.2202 0.006504 ** conservation 3 0.015357 1.1872 -1064.4 0.7847 0.503852 streetcategory 3 0.055147 1.1474 -1071.8 2.9158 0.035638 * heating 3 0.053972 1.1486 -1071.6 2.8508 0.038772 * storage 1 0.052809 1.1497 -1075.4 8.4514 0.004096 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets + elevator + garage + category + storage Df Sum of Sq RSS AIC F value Pr(>F)

1.1497 -1075.4 age 1 0.001743 1.1480 -1073.7 0.2779 0.598711 floor 1 0.002001 1.1477 -1073.8 0.3190 0.572911 rooms 1 0.008907 1.1408 -1075.1 1.4288 0.233502 out 3 0.073994 1.0757 -1083.9 4.1500 0.007136 ** conservation 3 0.015323 1.1344 -1072.3 0.8149 0.487130 streetcategory 3 0.036807 1.1129 -1076.5 1.9953 0.116308 heating 3 0.045544 1.1042 -1078.2 2.4886 0.061928 . --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > formula(mod.fs) logtotalprice ~ area + zone + toilets + elevator + garage + category + storage

Forward selection selects the variables area, zone, toilets, elevator, garage, category, and storage.

K14521_SM-Color_Cover.indd 654

30/06/15 11:53 am

Chapter 12:

Regression

649

(i) > > > > >

modelH

set.seed(5) cv.error5 CVNS CVNS CVNe CVNf CVNg CVNh 0.007113846 0.007080105 0.007796288 0.007481615 > which.min(CVNS) CVNf 2 > CV5S CV5S CV5e CV5f CV5g CV5h 0.007780027 0.007878634 0.008180095 0.008046479 > which.min(CV5S) CV5e 1

K14521_SM-Color_Cover.indd 655

30/06/15 11:54 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

650

Model (F) has the smallest CV 5 = 0.0079 as well as the smallest CV n = 0.0071 error. (g) > residualPlots(modelF)

140

180

Z35

0.0

1.8

1.0

1.5

1.0

0.2 0.1

2.0

2A

3A

4A

5A

0.1

Pearson residuals

0.2

category

−0.2

−0.2 out

E75

0.8

−0.2 0.5

0.0

Pearson residuals

0.2

E50

0.6

0.0

Pearson residuals

0.2 0.0

garage

0.1

E25

0.4

elevator

0.1

2.0

0.0 E100

0.2

0.2

1.6

0.2

Z56

−0.2 1.4

−0.2

Pearson residuals

Z47

0.0

Pearson residuals

0.2 0.1 0.0

1.2

toilets

K14521_SM-Color_Cover.indd 656

Z42 zone

−0.2

Pearson residuals

area

1.0

0.1 −0.2

Z11

0.1

100

0.0

60

0.0

Pearson residuals

0.1 −0.2

0.0

Pearson residuals

0.2 0.1 0.0

Pearson residuals

−0.2

0.2

Test stat Pr(>|t|) -1.621 0.107 NA NA -0.393 0.695 1.785 0.076 0.426 0.671 NA NA NA NA 0.879 0.380 NA NA NA NA 0.495 0.621

area zone elevator toilets garage category out storage heating streetcategory Tukey test

0.0

0.5

1.0 storage

1.5

2.0

1A

3A

3B

4A

heating

30/06/15 11:54 am

651

0.1

0.2

Regression

−0.2

0.0

Pearson residuals

0.2 0.1 0.0 −0.2

Pearson residuals

Chapter 12:

S2

S3

S4

streetcategory

S5

12.0

12.4

12.8

13.2

Fitted values

Assumptions with respect to the residuals seem to be satisfied with Model (F). Model (I) Model (F) is assigned to the object modelI.

> modelI influenceIndexPlot(modelI, id.n = 3) > outlierTest(modelI) rstudent unadjusted p-value Bonferonni p 93 4.250659 3.4698e-05 0.0075643

K14521_SM-Color_Cover.indd 657

30/06/15 11:54 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

652

Diagnostic Plots Cook's distance 0.04 0.08 0.12

93

31

0.00

160

Studentized residuals −2 0 2 4

93

156

Bonferroni p−value 0.4 0.8

3

156 3

0.0

93 92

hat−values 0.3 0.5

9

0.1

185

0

50

100

150

200

Index

Observation 93 is an outlier according to the Bonferroni outlier test. (ii)

> influencePlot(modelI, id.n = 3) 3 9 31 92 93 156 160 185

K14521_SM-Color_Cover.indd 658

StudRes -3.23941807 -0.79653792 -1.86226419 0.02263487 4.25065863 -3.09043662 2.11225702 0.74383798

Hat 0.1243814 0.5268393 0.3717995 0.6218219 0.2457917 0.1530896 0.3116083 0.4394763

CookD 0.18133476 0.12831016 0.21695582 0.00443887 0.35322493 0.19565031 0.21460798 0.10057032

30/06/15 11:54 am

Chapter 12:

653

4

93

2

160

185 0

92 9 31

−2

Studentized Residuals

Regression

3 0.1

156 0.2

0.3

0.4

0.5

0.6

Hat−Values

Observation 93 has the largest Cook’s distance value and is an influential observation. (iii) Removing observations 3, and 93 from Model (I). > modelI checking.plots(modelI) > shapiro.test(resid(modelI)) Shapiro-Wilk normality test data: resid(modelI) W = 0.99441, p-value = 0.6049

K14521_SM-Color_Cover.indd 659

30/06/15 11:54 am

Probability and Statistics with R, Second Edition: Exercises and Solutions

654

100

150

2 0

207

207 156

200

−2

0

2

ordered values

Theoretical Quantiles

Standardized residuals versus fitted values for modelI

Density plot of standardized residuals for modelI

0.2 0.1

0

Density

2

0.3

116

−2

standardized residuals

50

116

−2

2 0

156 0

Normal Q−Q plot of standardized residuals from modelI standardized residuals

116

−2

standardized residuals

Standardized residuals versus ordered values for modelI

12.0

12.4

12.8

0.0

207 156 13.2

fitted values

−4

−2

0

2

N = 216 Bandwidth = 0.3069

The normality assumptions for the residuals appear to be satisfied. (v) > vif(modelI) GVIF Df GVIF^(1/(2*Df)) area 3.007687 1 1.734268 zone 178.394245 22 1.125039 elevator 2.278083 1 1.509332 toilets 3.438421 1 1.854298 garage 1.829758 1 1.352686 category 13.719834 6 1.243882 out 2.384567 3 1.155850 storage 1.463063 1 1.209571 heating 5.433721 3 1.325917 streetcategory 7.124974 3 1.387173

K14521_SM-Color_Cover.indd 660

30/06/15 11:54 am

Chapter 12:

Regression

655

Multicollinearity is not a problem with Model (I). The decrease in precision of estimation due to multicollinearity is less than 1.855 for all variables. (vi) The coefficients for Model (I) and 95% confidence intervals for the parameters of Model (I) are computed in R Code (12.4). R Code 12.4 > coef(summary(modelI)) (Intercept) area zoneZ21 zoneZ31 zoneZ32 zoneZ34 zoneZ35 zoneZ36 zoneZ37 zoneZ38 zoneZ41 zoneZ42 zoneZ43 zoneZ44 zoneZ45 zoneZ46 zoneZ47 zoneZ48 zoneZ49 zoneZ52 zoneZ53 zoneZ56 zoneZ61 zoneZ62 elevator toilets garage category2B category3A category3B category4A category4B category5A outE25 outE50 outE75 storage heating3A heating3B heating4A streetcategoryS3

K14521_SM-Color_Cover.indd 661

Estimate 11.7524914839 0.0041496205 0.3802200972 0.3389852061 0.2403972342 0.1681132305 0.3393209999 0.2217815708 0.3585080961 0.1927022820 0.2707262908 0.3337865477 0.2292942872 0.1956903684 0.1490659068 0.1346932202 0.0632316000 0.2299657516 0.2015766920 0.1346092984 0.1659316744 0.2017329937 0.1586852151 0.1298976193 0.1035564143 0.0858677405 0.0759192863 -0.0117734068 -0.0661178646 -0.0870663644 -0.1186452274 -0.1492007232 -0.1992746969 0.1707890114 -0.0002797115 0.0407193190 0.0376288633 -0.0040182917 -0.0097999797 0.0347673493 0.0209412769

Std. Error t value Pr(>|t|) 0.0778168104 151.02766898 1.300977e-185 0.0003977202 10.43351652 4.449800e-20 0.0435551810 8.72961812 2.102581e-15 0.0450382345 7.52660955 2.739747e-12 0.0404942389 5.93657865 1.553304e-08 0.0503495797 3.33892024 1.029979e-03 0.0483376686 7.01980484 4.848350e-11 0.0409454210 5.41651704 2.010000e-07 0.0420635177 8.52301747 7.429520e-15 0.0510651531 3.77365523 2.207769e-04 0.0392701711 6.89394223 9.740508e-11 0.0499736267 6.67925403 3.152129e-10 0.0507979499 4.51384923 1.173102e-05 0.0487539533 4.01383591 8.884359e-05 0.0418746595 3.55981179 4.792959e-04 0.0441661971 3.04969024 2.651214e-03 0.0461406978 1.37040840 1.723350e-01 0.0454032057 5.06496729 1.039851e-06 0.0480320286 4.19671411 4.322100e-05 0.0424127766 3.17379123 1.781088e-03 0.0419921817 3.95148972 1.129719e-04 0.0475626099 4.24141977 3.611259e-05 0.0397582510 3.99125241 9.695293e-05 0.0405609255 3.20253095 1.621587e-03 0.0179882299 5.75689851 3.825401e-08 0.0176860476 4.85511192 2.676492e-06 0.0142635424 5.32261088 3.140212e-07 0.0456010371 -0.25818287 7.965727e-01 0.0444057487 -1.48894831 1.383217e-01 0.0456240182 -1.90834494 5.800261e-02 0.0473737485 -2.50445091 1.318946e-02 0.0496295095 -3.00629051 3.038333e-03 0.0779315318 -2.55704838 1.141526e-02 0.0451573944 3.78208295 2.139872e-04 0.0125073847 -0.02236371 9.821836e-01 0.0323878425 1.25724086 2.103611e-01 0.0143838096 2.61605683 9.681683e-03 0.0373211302 -0.10766801 9.143838e-01 0.0476691945 -0.20558308 8.373583e-01 0.0393575474 0.88337185 3.782613e-01 0.0181382026 1.15453980 2.498712e-01

30/06/15 11:54 am

656

Probability and Statistics with R, Second Edition: Exercises and Solutions

streetcategoryS4 0.0203906494 0.0196300252 streetcategoryS5 -0.0187018119 0.0324744025

1.03874800 -0.57589395

3.003716e-01 5.654353e-01

> confint(modelI) (Intercept) area zoneZ21 zoneZ31 zoneZ32 zoneZ34 zoneZ35 zoneZ36 zoneZ37 zoneZ38 zoneZ41 zoneZ42 zoneZ43 zoneZ44 zoneZ45 zoneZ46 zoneZ47 zoneZ48 zoneZ49 zoneZ52 zoneZ53 zoneZ56 zoneZ61 zoneZ62 elevator toilets garage category2B category3A category3B category4A category4B category5A outE25 outE50 outE75 storage heating3A heating3B heating4A streetcategoryS3 streetcategoryS4 streetcategoryS5

2.5 % 11.598898894 0.003364612 0.294252129 0.250090030 0.160470866 0.068734673 0.243913495 0.140964672 0.275484331 0.091911346 0.193215953 0.235150036 0.129030750 0.099461213 0.066414904 0.047519246 -0.027839587 0.140350206 0.106772451 0.050896176 0.083048710 0.107855278 0.080211519 0.049839627 0.068051762 0.050959527 0.047766315 -0.101779427 -0.153764659 -0.177117744 -0.212150174 -0.247158027 -0.353093721 0.081658641 -0.024966429 -0.023206876 0.009238512 -0.077681669 -0.103888069 -0.042915450 -0.014859388 -0.018354532 -0.082798857

97.5 % 11.906084074 0.004934629 0.466188065 0.427880382 0.320323602 0.267491788 0.434728505 0.302598469 0.441531862 0.293493218 0.348236629 0.432423060 0.329557825 0.291919524 0.231716910 0.221867194 0.154302787 0.319581298 0.296380933 0.218322421 0.248814639 0.295610710 0.237158911 0.209955611 0.139061067 0.120775954 0.104072258 0.078232613 0.021528929 0.002985015 -0.025140281 -0.051243420 -0.045455673 0.259919382 0.024407006 0.104645514 0.066019214 0.069645085 0.084288110 0.112450148 0.056741941 0.059135831 0.045395233

(vii)

K14521_SM-Color_Cover.indd 662

30/06/15 11:54 am

Chapter 12:

Regression

657

> RC RC [1] 63.43 19.42

3.51

2.88

1.33

1.04

0.68

0.41

0.35

0.12

6.83

The relative contributions of area, zone, elevator, toilets, garage, category, out, storage, heating, and streetcategory to explaining the variability of log(totalprice) in Model (I) given in percentages are 63.43, 19.42, 3.51, 2.88, 1.33, 1.04, 0.68, 0.41, 0.35, and 0.12, respectively. (viii) The variable that explains the most variability in Model (I) is area (63.43%). (ix) Variables area and zone explain 82.85% of the variability in Model (I). (x) The without bias correction backtransformed predictions for Model (I) are listed beneath the fit column of the data frame OWBC, while the without bias correction backtransformed lower and upper confidence limits are listed beneath the columns labeled lwr and upr, respectively of the OWBC data frame. The bias corrected backtransformed predictions for Model (I) (Ypred ) are listed beneath the Ytilde.pred column of the data frame OWBC, while the bias corrected backtransformed lower and upper confidence limits are listed beneath the columns l.inf and l.sup, respectively of the OWBC data frame. > + > > > + + > + + > + > >

Yhat.pred