125 97 17MB
English Pages 682 Year 2015
solutions MANUAL FOR Probability and Statistics with R, Second Edition by
María Dolores Ugarte, Ana F. Militino, and Alan T. Arnholt
K14521_SM-Color_Cover.indd 1
30/06/15 11:46 am
K14521_SM-Color_Cover.indd 2
30/06/15 11:46 am
solutionS MANUAL FOR Probability and Statistics with R, Second Edition by
María Dolores Ugarte, Ana F. Militino, and Alan T. Arnholt
Boca Raton London New York
CRC Press is an imprint of the Taylor & Francis Group, an informa business
K14521_SM-Color_Cover.indd 3
30/06/15 11:46 am
CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2016 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Printed on acid-free paper Version Date: 20150619 International Standard Book Number-13: 978-1-4665-0443-1 (Ancillary) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com
K14521_SM-Color_Cover.indd 4
30/06/15 11:46 am
Contents
1 What Is R?
1
2 Exploring Data
27
3 General Probability and Random Variables
79
4 Univariate Probability Distributions
121
5 Multivariate Probability Distributions
171
6 Sampling and Sampling Distributions
209
7 Point Estimation
245
8 Confidence Intervals
307
9 Hypothesis Testing
345
10 Nonparametric Methods
429
11 Experimental Design
477
12 Regression
541
Bibliography
661
Index
671
v
K14521_SM-Color_Cover.indd 5
30/06/15 11:46 am
K14521_SM-Color_Cover.indd 6
30/06/15 11:46 am
Chapter 1 What Is R?
1. Calculate the following numerical results to three decimal places with R: √ (a) (7 − 8) + 53 − 5 ÷ 6 + 62 √ (b) ln 3 + 2 sin(π) − e3 √ (c) 2 × (5 + 3) − 6 + 92 (d) ln(5) − exp(2) + 23 √ (e) (9 ÷ 2) × 4 − 10 + ln(6) − exp(1) Solution: (a) 131.041 > round((7 - 8) + 5^3 - 5/6 + sqrt(62), 3) [1] 131.041 (b) -18.987 > round(log(3) - sqrt(2) * sin(pi) - exp(3), 3) [1] -18.987 (c) 94.551 > round(2 * (5 + 3) - sqrt(6) + 9^2, 3) [1] 94.551 (d) 2.22 > round(log(5) - exp(2) + 2^3, 3) [1] 2.22 (e) 13.911 > round(9/2 * 4 - sqrt(10) + log(6) - exp(1), 3) [1] 13.911
2. Create a vector named countby5 that is a sequence of 5 to 100 in steps of 5. Solution:
1
K14521_SM-Color_Cover.indd 7
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
2
> countby5 countby5 [1] [18]
5 90
10 15 95 100
20
25
30
35
40
45
50
55
60
65
70
75
80
85
3. Create a vector named Treatment with the entries “Treatment One” appearing 20 times, “Treatment Two” appearing 18 times, and “Treatment Three” appearing 22 times. Solution: > Treatment xtabs(~Treatment) Treatment Treatment One Treatment Three 20 22
Treatment Two 18
4. Provide the missing values in rep(seq( 15, 10, 10, 10, 5, 5, 5, 5.
,
,
),
) to create the sequence 20, 15,
Solution: > rep(seq(from = 20, to = 5, by = -5), times = 1:4) [1] 20 15 15 10 10 10
5
5
5
5
5. Vectors, sequences, and logical operators (a) Assign the names x and y to the values 5 and 7, respectively. Find xy and assign the result to z. What is the valued stored in z? (b) Create the vectors u = (1, 2, 5, 4) and v = (2, 2, 1, 1) using the c() function. (c) Provide R code to find which component of u is equal to 5. (d) Provide R code to give the components of v greater than or equal to 2. (e) Find the product u × v. How does R perform the operation? (f) Explain what R does when two vectors of unequal length are multiplied together. Specifically, what is u × c(u, v)? (g) Provide R code to define a sequence from 1 to 10 called G and subsequently to select the first three components of G. (h) Use R to define a sequence from 1 to 30 named J with an increment of 2 and subsequently to choose the first, third, and eighth values of J. (i) Calculate the scalar product (dot product) of q = (3, 0, 1, 6) by r = (1, 0, 2, 4).
K14521_SM-Color_Cover.indd 8
30/06/15 11:46 am
Chapter 1:
What Is R?
3
(j) Define the matrix X whose rows are the u and v vectors from part (b). (k) Define the matrix Y whose columns are the u and v vectors from part (b). (l) Find the matrix product of X by Y and name it W. (m) Provide R code that computes the inverse matrix of W and the transpose of that inverse. Solution: (a) The valued stored in z is 78125. > > > >
x = 2) [1] 1 2 (e) Multiplication of vectors with R is element by element. > uv uv [1] 2 4 5 4 (f) The values in the shorter vector are recycled until the two vectors are the same size. In this case, u*c(u, v) is the same as c(u, u)*c(u, v). > u * (c(u, v)) [1]
1
4 25 16
2
4
5
4
4
5
4
> c(u, u) * c(u, v) [1]
1
4 25 16
2
(g)
K14521_SM-Color_Cover.indd 9
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
4
> G G[1:3] [1] 1 2 3 (h) > J J[c(1, 3, 8)] [1]
1
5 15
(i) > q r q %*% r [1,]
[,1] 29
(j) > X X u v
[,1] [,2] [,3] [,4] 1 2 5 4 2 2 1 1
(k) > Y Y [1,] [2,] [3,] [4,]
u 1 2 5 4
v 2 2 1 1
(l) > W W u v u 46 15 v 15 10 (m) > solve(W) u v u 0.04255319 -0.06382979 v -0.06382979 0.19574468
K14521_SM-Color_Cover.indd 10
30/06/15 11:46 am
Chapter 1:
What Is R?
5
> t(solve(W)) u v u 0.04255319 -0.06382979 v -0.06382979 0.19574468
6. How many of the apartments in the VIT2005 data frame, part of the PASWR2 package, have a totalprice greater than e 400,000 and also have a garage? Use a single line of R code to determine the answer. Solution: There are 6 apartments with a totalprice greater than e 400,000 that also have a garage.
> dim(VIT2005[VIT2005$totalprice >=400000 & VIT2005$garage >=1, ])[1] [1] 6
7. Wheat harvested surface in Spain in 2004: Figure 1.1, made with R, depicts the autonomous communities in Spain. The Wheat Table that follows gives the wheat harvested surfaces in 2004 by autonomous communities in Spain measured in hectares. Provide R code to answer all the questions.
Asturias
Cantabria Pais Vasco
Galicia
Navarra
Rioja Castilla−Leon
Aragon
Cataluna
Madrid
Extremadura
Castilla−la Mancha Communidad Valenciana
Baleares
Murcia Andalucia
Canarias
FIGURE 1.1: Autonomous communities in Spain
K14521_SM-Color_Cover.indd 11
30/06/15 11:46 am
6
Probability and Statistics with R, Second Edition: Exercises and Solutions Wheat Table community Galicia Asturias Cantabria Pa´ıs Vasco Navarra La Rioja Arag´ on Catalu˜ na Islas Baleares
wheat.surface
community
wheat.surface
18817 Castilla y Le´on 65 Madrid 440 Castilla-La Mancha 25143 C. Valenciana 66326 Regi´on de Murcia 34214 Extremadura 311479 Andaluc´ıa 74206 Islas Canarias 7203
619858 13118 263424 6111 9500 143250 558292 100
(a) Create the variables community and wheat.surface from the Wheat Table in this problem. Store both variables in a data.frame named wheatspain. (b) Find the maximum, the minimum, and the range for the variable wheat.surface. (c) Which community has the largest harvested wheat surface? (d) Sort the autonomous communities by harvested surface in ascending order. (e) Sort the autonomous communities by harvested surfaces in descending order. (f) Create a new file called wheat.c where Asturias has been removed. (g) Add Asturias back to the file wheat.c. (h) Create in wheat.c a new variable called acre indicating the harvested surface in acres (1 acre = 0.40468564224 hectares). (i) What is the total harvested surface in hectares and in acres in Spain in 2004? (j) Define in wheat.c the row.names() using the names of the communities. Remove the community variable from wheat.c. (k) What percent of the autonomous communities have a harvested wheat surface greater than the mean wheat surface area? (l) Sort wheat.c by autonomous communities’ names (row.names()). (m) Determine the communities with less than 40,000 acres of harvested surface and find their total harvested surface in hectares and acres. (n) Create a new file called wheat.sum where the autonomous communities that have less than 40,000 acres of harvested surface are consolidated into a single category named “less than 40,000” with the results from (m). (o) Use the function dump() on wheat.c, storing the results in a new file named wheat.txt. Remove wheat.c from your path and check that you can recover it from wheat.txt. (p) Create a text file called wheat.dat from the wheat.sum file using the command write.table(). Explain the differences between wheat.txt and wheat.dat. (q) Use the command read.table() to read the file wheat.dat.
K14521_SM-Color_Cover.indd 12
30/06/15 11:46 am
Chapter 1:
What Is R?
7
Solution: (a) > + + + > + + > > >
community diff(range(wheat.spain$wheat.surface)) [1] 619793 (c) > wheat.spain[wheat.spain$wheat.surface == max(wheat.spain$wheat.surface),] community wheat.surface 10 Castilla y Leon 619858 (d) > IO head(IO) 2 17 3
K14521_SM-Color_Cover.indd 13
community wheat.surface Asturias 65 Islas Canarias 100 Cantabria 440
30/06/15 11:46 am
8
Probability and Statistics with R, Second Edition: Exercises and Solutions
13 C. Valenciana 9 Islas Baleares 14 Region de Murcia
6111 7203 9500
(e) > DO head(DO) community wheat.surface 10 Castilla y Leon 619858 16 Andalucia 558292 7 Aragon 311479 12 Castilla-La Mancha 263424 15 Extremadura 143250 8 Cataluna 74206 (f) > wheat.c head(wheat.c) community wheat.surface 1 Galicia 18817 3 Cantabria 440 4 Pais Vasco 25143 5 Navarra 66326 6 La Rioja 34214 7 Aragon 311479 (g) > RM wheat.c wheat.c community wheat.surface 1 Galicia 18817 3 Cantabria 440 4 Pais Vasco 25143 5 Navarra 66326 6 La Rioja 34214 7 Aragon 311479 8 Cataluna 74206 9 Islas Baleares 7203 10 Castilla y Leon 619858 11 Madrid 13118 12 Castilla-La Mancha 263424 13 C. Valenciana 6111 14 Region de Murcia 9500 15 Extremadura 143250 16 Andalucia 558292 17 Islas Canarias 100 2 Asturias 65
K14521_SM-Color_Cover.indd 14
30/06/15 11:46 am
Chapter 1:
What Is R?
9
(h) > wheat.c sum(wheat.c$wheat.surface) [1] 2151546 > sum(wheat.c$acre) [1] 5316586 (j) > > > >
nc AO head(AO)
K14521_SM-Color_Cover.indd 15
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
10
Andalucia Aragon Asturias C. Valenciana Cantabria Castilla y Leon
wheat.surface acre 558292 1379569.5763 311479 769681.3711 65 160.6185 6111 15100.6099 440 1087.2637 619858 1531702.4755
(m) The total harvested area is 36537 acres or 90284.8932 hectares. > lessthan40k lessthan40k Cantabria Islas Baleares Madrid C. Valenciana Region de Murcia Islas Canarias Asturias
wheat.surface 440 7203 13118 6111 9500 100 65
acre 1087.2637 17799.0006 32415.2839 15100.6099 23475.0112 247.1054 160.6185
> apply(lessthan40k, 2, sum) wheat.surface 36537.00
acre 90284.89
(n) > > > > >
lt40 wheat.c # no longer available
K14521_SM-Color_Cover.indd 16
30/06/15 11:46 am
Chapter 1: Error in eval(expr, envir, enclos):
What Is R?
11
object ’wheat.c’ not found
> source("wheat.txt") > head(wheat.c) Galicia Cantabria Pais Vasco Navarra La Rioja Aragon
wheat.surface acre 18817 46497.820 440 1087.264 25143 62129.706 66326 163895.115 34214 84544.635 311479 769681.371
(p) There are different values stored in each of wheat.txt and wheat.dat. Specifically, the values from part (m) are collapsed into one category "less than 40,000" in wheat.dat, whereas wheat.txt has all of the values. > write.table(x = wheat.sum, file = "wheat.dat") (q) > tail(read.table(file = "wheat.dat")) Cataluna Castilla y Leon Castilla-La Mancha Extremadura Andalucia less than 40,000
wheat.surface acre 74206 183367.02 619858 1531702.48 263424 650934.88 143250 353978.46 558292 1379569.58 36537 90284.89
8. Access the data from url http://www.stat.berkeley.edu/users/statlabs/data/babies.data and store the information in an object named BABIES using the function read.table(). A description of the variables can be found at http://www.stat.berkeley.edu/users/statlabs/labs.html. These data are a subset from a much larger study dealing with child health and development. (a) The variables bwt, gestation, parity, age, height, weight, and smoke use values of 999, 999, 9, 99, 99, 999, and 9, respectively, to denote “unknown.” R uses NA to denote a missing or unavailable value. Recode the missing values in BABIES. Hint: use something similar to BABIES$bwt[BABIES$bwt == 999] = NA. (b) Use the function na.omit() to create a “clean” data set that removes subjects if any observations on the subject are “unknown.” Store the modified data frame in a data frame named CLEAN. (c) How many missing values are there for gestation, age, height, weight, and smoke, respectively? How many rows of BABIES have no missing values, one missing value, two missing values, and three missing values, respectively? Note: the number of rows in CLEAN should agree with your answer for the number of rows in BABIES that have no missing values.
K14521_SM-Color_Cover.indd 17
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
12
(d) Use the function complete.cases() to create a “clean” data set that removes subjects if any observations on the subject are “unknown.” Store the modified data frame in a data frame named CLEAN2. Write a line of code that shows all of the values in CLEAN are the same as those in CLEAN2. (e) Sort the values in CLEAN by bwt, gestation, and age. Store the sorted values in a data frame named BGA and show the last six rows. (f) Store the data frame CLEAN in your working directory as a *.csv file. (g) What percent of the women in CLEAN are pregnant with their first child (parity = 0) and do not smoke? Solution: (a) > site BABIES summary(BABIES) bwt Min. : 55.0 1st Qu.:108.8 Median :120.0 Mean :119.6 3rd Qu.:131.0 Max. :176.0 height Min. :53.00 1st Qu.:62.00 Median :64.00 Mean :64.67 3rd Qu.:66.00 Max. :99.00
gestation Min. :148.0 1st Qu.:272.0 Median :280.0 Mean :286.9 3rd Qu.:288.0 Max. :999.0 weight Min. : 87 1st Qu.:115 Median :126 Mean :154 3rd Qu.:140 Max. :999
parity Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.2549 3rd Qu.:1.0000 Max. :1.0000 smoke Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4644 3rd Qu.:1.0000 Max. :9.0000
age Min. :15.00 1st Qu.:23.00 Median :26.00 Mean :27.37 3rd Qu.:31.00 Max. :99.00
> dim(BABIES) [1] 1236 > > > > > > > >
7
BABIES$bwt[BABIES$bwt == 999] = NA BABIES$gestation[BABIES$gestation == 999] = NA BABIES$parity[BABIES$parity == 9] = NA BABIES$age[BABIES$age == 99] = NA BABIES$height[BABIES$height == 99] = NA BABIES$weight[BABIES$weight == 999] = NA BABIES$smoke[BABIES$smoke == 999] = NA summary(BABIES) bwt Min. : 55.0 1st Qu.:108.8 Median :120.0
K14521_SM-Color_Cover.indd 18
gestation Min. :148.0 1st Qu.:272.0 Median :280.0
parity Min. :0.0000 1st Qu.:0.0000 Median :0.0000
age Min. :15.00 1st Qu.:23.00 Median :26.00
30/06/15 11:46 am
Chapter 1: Mean :119.6 3rd Qu.:131.0 Max. :176.0 height Min. :53.00 1st Qu.:62.00 Median :64.00 Mean :64.05 3rd Qu.:66.00 Max. :72.00 NA's :22
Mean :279.3 3rd Qu.:288.0 Max. :353.0 NA's :13 weight Min. : 87.0 1st Qu.:114.8 Median :125.0 Mean :128.6 3rd Qu.:139.0 Max. :250.0 NA's :36
What Is R?
Mean :0.2549 3rd Qu.:1.0000 Max. :1.0000
13 Mean :27.26 3rd Qu.:31.00 Max. :45.00 NA's :2
smoke Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4644 3rd Qu.:1.0000 Max. :9.0000
> dim(BABIES) [1] 1236
7
(b) > CLEAN dim(CLEAN) [1] 1184
7
(c) There are 13 missing values for gestation, 2 missing values for age, and 36 missing values for weight. There are 1184 rows of BABIES with no missing values, 33 rows of BABIES with one missing value, 17 rows of BABIES with two missing values, and 2 rows of BABIES with three missing values. > summary(BABIES) bwt Min. : 55.0 1st Qu.:108.8 Median :120.0 Mean :119.6 3rd Qu.:131.0 Max. :176.0 height Min. :53.00 1st Qu.:62.00 Median :64.00 Mean :64.05 3rd Qu.:66.00 Max. :72.00 NA's :22
gestation Min. :148.0 1st Qu.:272.0 Median :280.0 Mean :279.3 3rd Qu.:288.0 Max. :353.0 NA's :13 weight Min. : 87.0 1st Qu.:114.8 Median :125.0 Mean :128.6 3rd Qu.:139.0 Max. :250.0 NA's :36
parity Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.2549 3rd Qu.:1.0000 Max. :1.0000
age Min. :15.00 1st Qu.:23.00 Median :26.00 Mean :27.26 3rd Qu.:31.00 Max. :45.00 NA's :2
smoke Min. :0.0000 1st Qu.:0.0000 Median :0.0000 Mean :0.4644 3rd Qu.:1.0000 Max. :9.0000
> table(apply(is.na(BABIES), 1, sum)) 0 1184
K14521_SM-Color_Cover.indd 19
1 33
2 17
3 2
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
14 (d)
> CLEAN CLEAN2 sum(CLEAN != CLEAN2) [1] 0 (e) > BGAo BGA tail(BGA) 595 240 557 1100 748 633
bwt gestation parity age height weight smoke 170 303 1 21 64 129 0 173 293 0 30 63 110 0 174 281 0 37 67 155 0 174 284 0 39 65 163 0 174 288 0 25 61 182 0 176 293 1 19 68 180 0
(f) > write.csv(CLEAN, file = "CLEAN.csv") (g) 44.3412% of the women in CLEAN are pregnant with their first child and do not smoke. > xtabs(~parity + smoke, data = CLEAN) smoke parity 0 1 0 525 341 1 190 118
9 10 0
> prop.table(xtabs(~parity + smoke, data = CLEAN))[1,1]*100 [1] 44.34122
9. The data frame WHEATUSA2004 from the PASWR2 package has the USA wheat harvested crop surfaces in 2004 by states. It has two variables, states for the state and acres for thousands of acres. (a) Use the function row.names() to define the states as the row names for the data frame WHEATUSA2004 . (b) Define a new variable called ha for the surface area given in hectares where 1 acre = 0.40468564224 hectares. (c) Sort the file according to the harvested surface area in acres. (d) Which states fall in the top 10% of states for harvested surface area?
K14521_SM-Color_Cover.indd 20
30/06/15 11:46 am
Chapter 1:
What Is R?
15
(e) Save the contents of WHEATUSA2004 in a new file called WHEATUSA.txt in your favorite directory. Then, remove WHEATUSA2004 from your workspace, and check that the contents of WHEATUSA2004 can be recovered from WHEATUSA.txt. (f) Use the command write.table() to store the contents of WHEATUSA2004 in a file with the name WHEATUSA.dat. Explain the differences between storing WHEATUSA2004 using dump() and using write.table(). (g) Find the total harvested surface area in acres for the bottom 10% of the states. Solution: (a) > STATES row.names(WHEATUSA2004) head(WHEATUSA2004) AR CA CO DE GA ID
states acres AR 620 CA 320 CO 1700 DE 47 GA 190 ID 700
(b) > WHEATUSA2004$ha head(WHEATUSA2004) AR CA CO DE GA ID
states acres ha AR 620 250.90510 CA 320 129.49941 CO 1700 687.96559 DE 47 19.02023 GA 190 76.89027 ID 700 283.27995
(c) > io head(io) DE NY MS PA MD SC
states acres ha DE 47 19.02023 NY 100 40.46856 MS 135 54.63256 PA 135 54.63256 MD 145 58.67942 SC 180 72.84342
(d) Kansas, Oklahoma, and Texas are in the top 10% of states for harvested surface area.
K14521_SM-Color_Cover.indd 21
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
16
> top10 ans top10, ] > row.names(ans) [1] "KS" "OK" "TX" (e) > > > >
dump("WHEATUSA2004", "WHEATUSA.txt") rm(WHEATUSA2004) source("WHEATUSA.txt") head(WHEATUSA2004)
AR CA CO DE GA ID
states acres ha AR 620 250.90510 CA 320 129.49941 CO 1700 687.96559 DE 47 19.02023 GA 190 76.89027 ID 700 283.27995
(f) This question needs an answer! > write.table(WHEATUSA2004, "WHEATUSA.dat") (g) The total harvested area for the bottom 10% of states is 147 acres. > bottom10 ans ans DE NY
states acres ha DE 47 19.02023 NY 100 40.46856
> THA THA # Total Harvested Acres [1] 147
10. Use the data frame VIT2005 in the PASWR2 package, which contains data on the 218 used apartments sold in Vitoria (Spain) in 2005 to answer the following questions. A description of the variables can be obtained from the help file for this data frame. (a) Create a table of the number of apartments according to the number of garages. (b) Find the mean of totalprice according to the number of garages. (c) Create a frequency table of apartments using the categories: number of garages and number of elevators. (d) Find the mean flat price (total price) for each of the cells of the table created in part (c).
K14521_SM-Color_Cover.indd 22
30/06/15 11:46 am
Chapter 1:
What Is R?
17
(e) What command will select only the apartments having at least one garage? (f) Define a new file called data.c with the apartments that have category = "3B" and have an elevator. (g) Find the mean of totalprice and the mean of area using the information in data.c. Solution: (a) > xtabs(~garage, data = VIT2005) garage 0 1 167 49
2 2
(b) The average price for an apartment with no garage, one garage, and two garages is e 260537.4385, e 345987.7551, and e 369250, respectively. > tapply(VIT2005$totalprice, list(VIT2005$garage), mean) 0 1 2 260537.4 345987.8 369250.0 (c) > xtabs(~garage + elevator, data = VIT2005) elevator garage 0 1 0 44 123 1 0 49 2 0 2 (d) > tapply(VIT2005$totalprice, list(VIT2005$garage, VIT2005$elevator), + mean) 0 1 0 210492.1 278439.8 1 NA 345987.8 2 NA 369250.0 (e) > atleastonegarage = 1, ] > dim(atleastonegarage) [1] 51 15 (f)
K14521_SM-Color_Cover.indd 23
30/06/15 11:46 am
18
Probability and Statistics with R, Second Edition: Exercises and Solutions
> data.c dim(data.c) [1] 62 15 (g) The mean of totalprice and the mean of area using the data frame data.c are e 287564.5161 and 88.9524 m2 , respectively. > MeanTotalPrice MeanArea c(MeanTotalPrice, MeanArea) [1] 287564.51613
88.95242
11. Use the data frame EPIDURALF to answer the following questions: (a) How many patients have been treated with the Hamstring Stretch? (b) What percent of the patients treated with Hamstring Stretch were classified as each of Easy, Difficult, and Impossible? (c) What percent of the patients classified as Easy to palpate were assigned to the Traditional Sitting position? (d) What is the mean weight for each cell in a contingency table created with the variables Ease and Treatment? (e) What percent of the patients have a body mass index (BMI= kg/(cm/100)2 ) less than 25 and are classified as Easy to palpate? Solution: (a) A total of 171 patients have been treated with hamstring stretch position. > xtabs(~treatment, data = EPIDURALF) treatment Hamstring Stretch Traditional Sitting 171 171 > xtabs(~treatment, data = EPIDURALF)[1] Hamstring Stretch 171 (b) The percent of patients treated with hamstring stretch that were classified as Easy, Difficult, and Impossible was 58.4795%, 36.8421%, and 4.6784%, respectively. > T1 T1
K14521_SM-Color_Cover.indd 24
30/06/15 11:46 am
Chapter 1:
What Is R?
19
ease treatment Difficult Easy Impossible Hamstring Stretch 63 100 8 Traditional Sitting 51 107 13 > prop.table(T1[1, ]) * 100 Difficult 36.842105
Easy Impossible 58.479532 4.678363
(c) 51.6908% of the patients classified as easy to palpate were assigned to the traditional sitting position. > T1 ease treatment Difficult Easy Impossible Hamstring Stretch 63 100 8 Traditional Sitting 51 107 13 > prop.table(T1[, "Easy"])[2] * 100 Traditional Sitting 51.69082 (d) > tapply(EPIDURALF$kg, list(EPIDURALF$ease, EPIDURALF$treatment), + mean) Difficult Easy Impossible
Hamstring Stretch Traditional Sitting 92.66667 94.27451 78.67000 79.40187 127.87500 113.61538
(e) 9.0643% of patients have a body mass index less than 25 and are classified as easy to palpate. > EPIDURALF$BMI EPIDURALF[1:5, 3:8] 1 2 3 4 5
cm ease treatment oc complications BMI 172 Difficult Traditional Sitting 0 None 39.21038 176 Easy Hamstring Stretch 0 None 27.76343 157 Difficult Traditional Sitting 0 None 29.21011 169 Easy Hamstring Stretch 2 None 22.05805 163 Impossible Traditional Sitting 0 None 42.90715
> mean(EPIDURALF$ease =="Easy" & EPIDURALF$BMI + > > + + > >
Nationality MetricToEnglish
picker + + + + + + > >
pickerM 6) stop("m must be less than 6") n >
IRF + + + + + + + + + > >
A · ni i n·t n
1+
i n·t n i n
−1
i n·t i A· =R 1+ −1 n n =R
−1
ARF library(MASS) > help(package = "MASS") (b) The description file says lqs() fits a regression to the points in the data set, thereby achieving a regression estimator with a high breakdown point. (c) The function search() provides a list of attached packages and the function library() shows all installed packages. 2. Load Cars93 from the MASS package. (a) Create density histograms for the variables Min.Price, Max.Price, Weight, and Length variables using a different color for each histogram. (b) Superimpose estimated density curves over the histograms. (c) Use the bwplot() function from lattice to create a box and whiskers plot of Price for every type of vehicle according to the drive train. Do you observe any differences between prices? (d) Create a graph similar to the one created in (c) using functions from ggplot2. Solution: (a) and (b) use the following code: > p1 p2 p3 p4 multiplot(p1, p2, p3, p4, layout = matrix(c(1, 2, 3, 4), + byrow = TRUE, nrow = 2))
0.06 0.04 0.04
density
density
0.03 0.02
0.02 0.01 0.00
0.00 0
20
Min.Price
40
0
20
40
Max.Price
60
80
0.03 4e−04
density
density
0.02
2e−04
0.01
0e+00
0.00 1000
2000
3000
Weight
4000
5000
150
175
Length
200
225
(c) Vehicles with rear wheel drive trains tend to be more expensive than the same type of vehicles with front wheel drive trains.
K14521_SM-Color_Cover.indd 34
30/06/15 11:46 am
Chapter 2:
Exploring Data
29
> bwplot(Price ~ DriveTrain | Type, data = Cars93, as.table = TRUE) Compact
Large
Midsize
Small
Sporty
Van
60 50 40 30 20
Price
10
60 50 40 30 20 10 4WD
Front
Rear
4WD
Front
Rear
4WD
Front
Rear
(d) > ggplot(data = Cars93, aes(x = DriveTrain, y = Price)) + + geom_boxplot() + + facet_wrap(~ Type) + + theme_bw() # black and white Compact
Large
Midsize
Small
Sporty
Van
60
40
Price
20
60
40
20
4WD
Front
Rear
4WD
Front
DriveTrain
Rear
4WD
Front
Rear
3. Load the data frame WHEATSPAIN from the PASWR2 package.
K14521_SM-Color_Cover.indd 35
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
30
(a) Find the quantiles, deciles, mean, maximum, minimum, interquartile range, variance, and standard deviation of the variable hectares. Comment on the results. What was Spain’s 2004 total harvested wheat area in hectares? (b) Create a function that calculates the quantiles, the mean, the variance, the standard deviation, the total, and the range of any variable. (c) Which communities are below the 10th percentile in hectares? Which communities are above the 90th percentile? In which percentile is Navarra? (d) Create and display in the same graphics device a frequency histogram of the variable acres and a density histogram of the variable acres. Superimpose a density curve over the second histogram. (e) Explain why using breaks of 0; 100,000; 250,000; 360,000; and 1,550,000 automatically results in a density histogram when using hist() from base graphics. (f) Create and display in the same graphics device a barplot of acres and a density histogram of acres using break points of 0; 100,000; 250,000; 360,000; and 1,550,000. (g) Add vertical lines to the density histogram of acres to indicate the locations of the mean and the median, respectively. (h) Create a boxplot of hectares and label the communities that appear as outliers in the boxplot. (Hint: Use identify().) (i) Determine the community with the largest harvested wheat surface area using either acres or hectares. Remove this community from the data frame and compute the mean, median, and standard deviation of hectares. How do these values compare to the values for these statistics computed in (a)? Solution: (a) The distribution of harvested wheat is unimodal and skewed to the right. This skew is seen in how much larger the mean (126561.5294) is versus the median (25143). The difference between Q1 and Q2 is also much smaller than the difference between Q3 and Q2 . The total harvested area is 2151546 hectares. > quantile(WHEATSPAIN$hectares) 0% 65
25% 7203
50% 75% 100% 25143 143250 619858
> quantile(WHEATSPAIN$hectares, probs = seq(from = 0.1, to = 1.0, by = 0.1)) 10% 20% 304.0 6329.4 90% 100% 410204.2 619858.0
30% 9040.6
40% 15397.6
50% 25143.0
60% 53481.2
70% 80% 88014.8 239389.2
> mean(WHEATSPAIN$hectares) [1] 126561.5 > IQR(WHEATSPAIN$hectares)
K14521_SM-Color_Cover.indd 36
30/06/15 11:46 am
Chapter 2:
Exploring Data
31
[1] 136047 > var(WHEATSPAIN$hectares) [1] 38934822657 > sd(WHEATSPAIN$hectares) [1] 197319.1 > sum(WHEATSPAIN$hectares) [1] 2151546 (b) > describe WHEATSPAIN[order(WHEATSPAIN$hectares), ] community hectares acres 2 Asturias 65 160.6 17 Canarias 100 247.1 3 Cantabria 440 1087.3 13 C.Valenciana 6111 15100.6 9 Baleares 7203 17799.0 14 Murcia 9500 23475.0 11 Madrid 13118 32415.3 1 Galicia 18817 46497.8 4 P.Vasco 25143 62129.7 6 La Rioja 34214 84544.6 5 Navarra 66326 163895.1 8 Cataluna 74206 183367.0 15 Extremadura 143250 353978.5 12 Castilla-La Mancha 263424 650934.9 7 Aragon 311479 769681.4 16 Andalucia 558292 1379569.6 10 Castilla-Leon 619858 1531702.5 > which(WHEATSPAIN[order(WHEATSPAIN$hectares), ]$community=="Navarra") [1] 11 > pk pk [1] 0.625 > quantile(WHEATSPAIN$hectares, probs = pk) 62.5% 66326
(d)
> p1 p2 multiplot(p1, p2)
K14521_SM-Color_Cover.indd 38
30/06/15 11:46 am
Chapter 2:
Exploring Data
33
8
count
6
4
2
0 0
500000
acres
1000000
1500000
density
7.5e−06
5.0e−06
2.5e−06
0.0e+00 0
500000
acres
1000000
1500000
(e) If the breaks used in hist() are not equidistant, the default is to produce a density histogram.
(f)
> > + > + + + > + + + >
K14521_SM-Color_Cover.indd 39
bins noCL mean(WHEATSPAIN$hectares) [1] 126561.5 > mean(noCL$hectares) [1] 95730.5 > median(WHEATSPAIN$hectares) [1] 25143 > median(noCL$hectares) [1] 21980 > sd(WHEATSPAIN$hectares) [1] 197319.1 > sd(noCL$hectares) [1] 155864.7
4. Load the WHEATUSA2004 data frame from the PASWR2 package. (a) Find the quantiles, deciles, mean, maximum, minimum, interquartile range, variance, and standard deviation for the variable acres. Comment on what the most appropriate measures of center and spread would be for this variable. What is the USA’s 2004 total harvested wheat surface area? (b) Which states are below the 20th percentile? Which states are above the 80th percentile? In which quantile is WI (Wisconsin)? (c) Create a frequency and a density histogram in the same graphics device using square plotting regions of the values in ACRES. (d) Add vertical lines to the density histogram from (c) to indicate the location of the mean and the median. (e) Create a boxplot of the acres and locate the outliers’ communities and their values.
K14521_SM-Color_Cover.indd 41
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
36
(f) Determine the state with the largest harvested wheat surface in acres. Remove this state from the data frame and compute the mean, median, and standard deviation of acres. How do these values compare to the values for these statistics computed in (a)? Solution: (a) The distribution of harvested wheat is unimodal and skewed to the right. This skew is seen in how much larger the mean (1148.7333) is versus the median (630). The difference between Q1 and Q2 is also much smaller than the difference between Q3 and Q2 . The total harvested area is 34462 acres. > quantile(WHEATUSA2004$acres) 0% 47.00
25% 198.75
50% 75% 100% 630.00 1213.75 8500.00
> quantile(WHEATUSA2004$acres, probs = seq(from = 0.1, to = 1.0, by = 0.1)) 10% 135.0
20% 180.0
30% 263.5
40% 416.0
50% 630.0
60% 824.0
70% 80% 90% 100% 982.5 1634.0 1925.0 8500.0
> mean(WHEATUSA2004$acres) [1] 1148.733 > IQR(WHEATUSA2004$acres) [1] 1015 > var(WHEATUSA2004$acres) [1] 2980303 > sd(WHEATUSA2004$acres) [1] 1726.355 > sum(WHEATUSA2004$acres) [1] 34462 (b) DE, NY, MS, PA, and MD are below NA thousands of acres . NE, CO, WA, TX, OK, and KS are above 1634 thousands of acres. Since WI is the ninth out of thirty states, it corresponds to the 8/29 × 100 = 27.5862th percentile. > bottom20 bottom20 20% 180 > WHEATUSA2004[WHEATUSA2004$acres < bottom20, ] # bottom states states acres
K14521_SM-Color_Cover.indd 42
ha
30/06/15 11:46 am
Chapter 2: DE MD MS NY PA
DE MD MS NY PA
47 145 135 100 135
Exploring Data
37
19.02023 58.67942 54.63256 40.46856 54.63256
> top20 top20 80% 1634 > WHEATUSA2004[WHEATUSA2004$acres > top20, ] CO KS NE OK TX WA
# top states
states acres ha CO 1700 687.9656 KS 8500 3439.8280 NE 1650 667.7313 OK 4700 1902.0225 TX 3500 1416.3997 WA 1750 708.1999
> WHEATUSA2004[order(WHEATUSA2004$acres), ] DE NY MS PA MD SC VA GA WI TN CA KY IN NC AR MI ID OR OH IL MO Other SD MT NE CO WA
K14521_SM-Color_Cover.indd 43
states acres DE 47 NY 100 MS 135 PA 135 MD 145 SC 180 VA 180 GA 190 WI 225 TN 280 CA 320 KY 380 IN 440 NC 460 AR 620 MI 640 ID 700 OR 780 OH 890 IL 900 MO 930 Other 1105 SD 1250 MT 1630 NE 1650 CO 1700 WA 1750
ha 19.02023 40.46856 54.63256 54.63256 58.67942 72.84342 72.84342 76.89027 91.05427 113.31198 129.49941 153.78054 178.06168 186.15540 250.90510 258.99881 283.27995 315.65480 360.17022 364.21708 376.35765 447.17763 505.85705 659.63760 667.73131 687.96559 708.19987
30/06/15 11:46 am
38
Probability and Statistics with R, Second Edition: Exercises and Solutions
TX OK KS
TX OK KS
3500 1416.39975 4700 1902.02252 8500 3439.82796
> which(WHEATUSA2004[order(WHEATUSA2004$acres), ]$states=="WI") [1] 9 > pk pk [1] 0.2758621 > quantile(WHEATUSA2004$acres, probs = pk) 27.58621% 225 (c) > p1 p2 multiplot(p1, p2) 10.0
count
7.5
5.0
2.5
0.0 0
2500
acres
5000
7500
0.0012
density
0.0009
0.0006
0.0003
0.0000 0
2500
acres
5000
7500
(d) > p2 p2 + geom_vline(xintercept = c(median(WHEATUSA2004$acres),
K14521_SM-Color_Cover.indd 44
30/06/15 11:46 am
Chapter 2: + + + + +
Exploring Data
39
mean(WHEATUSA2004$acres))) + annotate("text", label = "Median", x = median(WHEATUSA2004$acres), y = 0.0012) + annotate("text", label = "Mean", x = mean(WHEATUSA2004$acres), y = 0.0010)
0.00125
Median
Mean
0.00100
density
0.00075
0.00050
0.00025
0.00000 0
2500
acres
5000
7500
(e) The three outliers correspond to KS, OK, and TX.
> boxplot(WHEATUSA2004$acres) > OUTA OUTA
# outlier values
[1] 8500 4700 3500 > WHEATUSA2004[WHEATUSA2004$acres %in% OUTA, ] KS OK TX
K14521_SM-Color_Cover.indd 45
states acres ha KS 8500 3439.828 OK 4700 1902.023 TX 3500 1416.400
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
0
2000
4000
6000
8000
40
(f) Based on the output from part (b), KS the ninth indexed position in the WHEATUSA2004 data frame has the largest harvested wheat surface. The mean, median, and standard deviation are smaller than those computed in part (a) once KS is removed. > noKS mean(WHEATUSA2004$acres) [1] 1148.733 > mean(noKS$acres) [1] 895.2414 > median(WHEATUSA2004$acres) [1] 630 > median(noKS$acres) [1] 620 > sd(WHEATUSA2004$acres) [1] 1726.355 > sd(noKS$acres) [1] 1044.102
5. The data frame VIT2005 in the PASWR2 package contains descriptive information and the appraised total price (in euros) for apartments in Vitoria, Spain.
K14521_SM-Color_Cover.indd 46
30/06/15 11:46 am
Chapter 2:
Exploring Data
41
(a) Create a frequency table, a piechart, and a barplot showing the number of apartments grouped by the variable out. For you, which method conveys the information best?
(b) Characterize the distribution of the variable totalprice.
(c) Characterize the relationship between totalprice and area.
(d) Create a Trellis plot of totalprice versus area conditioning on toilets. Create the same graph with ggplot2 graphics. Are there any outliers? Ignoring any outliers, between what two values of area do apartments have both one and two bathrooms?
(e) Use the area values reported in (d) to create a subset of apartments that have both one and two bathrooms. By how much does an additional bathroom increase the appraised value of an apartment? Would you be willing to pay for an additional bathroom if you lived in Vitoria, Spain?
Solution: (a) The barplot is the easiest to read. > VIT2005$out levels(VIT2005$out) [1] "E25"
"E50"
"E75"
"E100"
> xtabs(~out, data = VIT2005) out E25 3 > + > + + > + + >
K14521_SM-Color_Cover.indd 47
E50 87
E75 E100 6 122
p1 max(VIT2005$totalprice) [1] 560000 > median(VIT2005$totalprice) [1] 269750 > IQR(VIT2005$totalprice) [1] 100125
20
count
15
10
5
0 2e+05
3e+05
totalprice
4e+05
5e+05
6e+05
(c) There is a positive linear relationship between totalprice and area.
K14521_SM-Color_Cover.indd 48
30/06/15 11:46 am
Chapter 2:
Exploring Data
43
> ggplot(data = VIT2005, aes(x = area, y = totalprice)) + + geom_point() + + theme_bw()
5e+05
totalprice
4e+05
3e+05
2e+05
80
120
160
area
(d) Apartments with one bathroom are generally between 50 and 100 m2 , while apartments with two bathrooms are generally between 80 and 160 m2 . The intersection of apartments with one and two bathrooms is roughly 80 to 100 m2 . > xyplot(totalprice ~ area | toilets, data = VIT2005, layout = c(1, 2), + as.table = TRUE) > TEXT ggplot(data = VIT2005, aes(x = area, y = totalprice, + color = as.factor(toilets))) + + geom_point() + + facet_grid(toilets ~ .) + + theme_bw() + + guides(color = guide_legend(TEXT))
toilets
5e+05
5e+05
4e+05 1
4e+05
3e+05
2e+05
toilets 5e+05
100
150
area
2
4e+05
3e+05
3e+05
2
50
1
5e+05
4e+05
2e+05
K14521_SM-Color_Cover.indd 49
Number of Toilets
2e+05
totalprice
totalprice
3e+05
2e+05 80
120
area
160
30/06/15 11:46 am
44
Probability and Statistics with R, Second Edition: Exercises and Solutions
(e) The median increase in totalprice for a second bathroom for apartments between 80 and 100 m2 is e 36000. Answers will vary for answering whether readers would be willing to spend e 36000 for an additional bathroom. > bothbaths = 80 & area ANS ANS 1 2 255000 291000 > diff(ANS) 2 36000
6. Consider the data frame PAMTEMP from the PASWR2 package, which contains temperature and precipitation for Pamplona, Spain, from January 1, 1990, to December 31, 2010. (a) Create side-by-side violin plots of the variable tmean for each month. Make sure the level of month is correct. Hint: Look at the examples for PAMTEMP. Characterize the pattern of side-by-side violin plots. (b) Create side-by-side plots of the variable tmean for each year. Characterize the pattern of side-by-side violin plots. (c) Find the date for the minimum value of tmean. (d) Find the date for the maximum value of tmean. (e) Find the date for the maximum value of precip. (f) How many days have reported a tmax value greater than 38 ◦ C? (g) Create a barplot showing the total precipitation by month for the period January 1, 1990, to December 31, 2010. Based on your barplot, which month had the least amount of precipitation? Which month had the greatest amount of precipitation? Hint: Use the plyr package to create an appropriate data frame. (h) Create a barplot showing the total precipitation by year for the period January 1, 1990, to December 31, 2010. Based on your barplot, which year had the least amount of precipitation? Which year had the greatest amount of precipitation? Hint: Use the plyr package to create an appropriate data frame. (i) Create a graph showing the maximum temperature versus year and the minimum temperature versus year. Does the graph suggest temperatures are becoming more extreme over time? Solution: (a)
K14521_SM-Color_Cover.indd 50
30/06/15 11:46 am
Chapter 2:
Exploring Data
45
> library(PASWR2) > levels(PAMTEMP$month) [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" [12] "Sep" > PAMTEMP$month levels(PAMTEMP$month) [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" [12] "Dec" > ggplot(data = PAMTEMP) + + geom_violin(aes(x = month, y = tmean, fill = month)) + + theme_bw() + + guides(fill = FALSE) + + labs(x = "", y = "Temperature (Celsius)")
30
Temperature (Celsius)
20
10
0
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
The center of each violin plot from January to July generally increases. From August to December, the center of each violin plot decreases. There is a cyclical pattern of warming and then cooling throughout the year. (b) > ggplot(data = PAMTEMP) + + geom_violin(aes(x = as.factor(year), y = tmean, + fill = as.factor(year))) + + theme_bw() + + guides(fill = FALSE) + + labs(x = "", y = "Temperature (Celsius)") + + theme(axis.text.x = element_text(angle = 60, hjust = 1))
K14521_SM-Color_Cover.indd 51
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
46
Temperature (Celsius)
30
20
10
10
09
20
08
20
07
20
06
20
05
20
04
20
03
20
02
20
01
20
00
20
99
20
98
19
97
19
96
19
95
19
94
19
93
19
92
19
91
19
19
19
90
0
There is no apparent pattern from the side-by-side violin plots of tmean. Temperature variations over the time period 1990 to 2010 for Pamplona, Spain, appears similar. (c) > PAMTEMP[which.min(PAMTEMP$tmean), ] 4285
tmax tmin precip day month year tmean 2 -10 0.5 25 Dec 2001 -4
The minimum value of tmean is -4 ◦ C which occurred on Dec 25, 2001. (d) > PAMTEMP[which.max(PAMTEMP$tmean), ] 4873
tmax tmin precip day month year tmean 39 23 0 5 Aug 2003 31
The maximum value of tmean is 31 ◦ C which occurred on Aug 5, 2003. (e) > PAMTEMP[which.max(PAMTEMP$precip), ] 1455
tmax tmin precip day month year tmean 8.6 4 69.2 25 Dec 1993 6.3
The maximum value of tmean is 69.2 mm which occurred on Dec 25, 1993. (f) > sum(PAMTEMP$tmax > 38) [1] 15
K14521_SM-Color_Cover.indd 52
30/06/15 11:46 am
Chapter 2:
Exploring Data
47
15 days reported a value greater than 38 ◦ C. (g) > library(plyr) > SEL head(SEL) 1 2 3 4 5 6
year month TP 1990 Jan 31.1008 1990 Feb 26.8005 1990 Mar 9.3001 1990 Apr 121.1001 1990 May 120.5002 1990 Jun 77.0006
Total Percipitation (1990−2010) in mm
> ggplot(data = SEL, aes(x = month, y = TP, fill = month)) + + geom_bar(stat = "identity") + + labs(y = "Total Percipitation (1990-2010) in mm", x= "") + + theme_bw() + + guides(fill = FALSE)
1500
1000
500
0 Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
Nov
Dec
August has the minimum total precipitation of all of the months for the period 1990-2010. November has the maximum total precipitation of all of the months for the period 1990-2010. (h) > SELY head(SELY) year TP 1 1990 692.5048 2 1991 704.0052
K14521_SM-Color_Cover.indd 53
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
48 3 4 5 6
1992 1993 1994 1995
902.8038 752.1041 638.8045 582.3028
> ggplot(data = SELY, aes(x = year, y = TP, fill = as.factor(year))) + + geom_bar(stat = "identity") + + labs(x = "", y = "Total Percipitation (mm)") + + theme_bw() + + guides(fill = FALSE) > SELY[which.max(SELY$TP), ] year TP 8 1997 929.2025 > SELY[which.min(SELY$TP), ] year TP 9 1998 566.2011
Total Percipitation (mm)
750
500
250
0 1990
1995
2000
2005
2010
The greatest yearly total precipitation on record (929.2025 mm) occurred in 1997. The least yearly total precipitation on record (566.2011 mm) occurred in 1998. (i) > SEL head(SEL) year Tmax Tmin 1 1990 37.0 -5.6 2 1991 38.2 -5.2 3 1992 36.4 -4.4
K14521_SM-Color_Cover.indd 54
30/06/15 11:46 am
Chapter 2:
Exploring Data
49
4 1993 37.6 -4.5 5 1994 37.2 -6.8 6 1995 40.0 -5.0 > ggplot(data = SEL, aes(x = year, y = Tmax)) + + geom_line(color = "red") + + geom_line(aes(x = year, y = Tmin), color = "blue") + + theme_bw() + + labs(y = "Temperature (Celsius)") + + geom_smooth(method = "lm", color = "red") + + geom_smooth(aes(x = year, y = Tmin), method = "lm") 40
Temperature (Celsius)
30
20
10
0
−10 1990
1995
2000
year
2005
2010
Based on the graph, there is too much variability from year to year to make any statement about the weather becoming more extreme over time. 7. Access the data from url http://www.stat.berkeley.edu/users/statlabs/data/babies.data and store the information in an object named BABIES using the function read.table(). A description of the variables can be found at http://www.stat.berkeley.edu/users/statlabs/labs.html. These data are a subset from a much larger study dealing with child health and development. (a) Create a “clean” data set that removes subjects if any observations on the subject are “unknown.” Note that bwt, gestation, parity, age, height, weight, and smoke use values of 999, 999, 9, 99, 99, 999, and 9, respectively, to denote “unknown.” Store the modified data set in an object named CLEAN. (b) Use the information in CLEAN to create a density histogram of the birth weights of babies whose mothers have never smoked (smoke=0) and another histogram placed directly below the first in the same graphics device for the birth weights of babies whose mothers currently smoke (smoke=1). Make the range of the x-axis 30 to 180 (ounces) for both histograms. Superimpose a density curve over each histogram.
K14521_SM-Color_Cover.indd 55
30/06/15 11:46 am
50
Probability and Statistics with R, Second Edition: Exercises and Solutions
(c) Based on the histograms in (b), characterize the distribution of baby birth weight for both non-smoking and smoking mothers. (d) What is the mean weight difference between babies of smokers and non-smokers? Can you think of any reasons not to use the mean as a measure of center to compare birth weights in this problem? (e) Create side-by-side boxplots to compare the birth weights of babies whose mothers never smoked and those who currently smoke. Use traditional graphics (boxplot()), lattice graphics (bwplot()), and ggplot graphics to create the boxplots. (f) What is the median weight difference between babies who are firstborn and those who are not? (g) Create a single graph of the densities for pre-pregnancy weight for mothers who have never smoked and for mothers who currently smoke. Make sure both densities appear on the same graphics device and use an appropriate legend. (h) Characterize the pre-pregnancy distribution of weight for mothers who have never smoked and for mothers who currently smoke. (i) What is the mean pre-pregnancy weight difference between mothers who do not smoke and those who do? Can you think of any reasons not to use the mean as a measure of center to compare pre-pregnancy weights in this problem? (j) Compute the body mass index (BMI) for each mother in CLEAN. Recall that BMI is defined as kg/m2 (0.0254 m= 1 in., and 0.45359 kg= 1 lb.). Add the variables weight in kg, height in m, and BMI to CLEAN and store the result in CLEANP. (k) Characterize the distribution of BMI. (l) Group pregnant mothers according to their BMI quartile. Find the mean and standard deviation for baby birth weights in each quartile for mothers who have never smoked and those who currently smoke. Find the median and IQR for baby birth weights in each quartile for mothers who have never smoked and those who currently smoke. Based on your answers, would you characterize birth weight in each group as relatively symmetric or skewed? Create histograms and densities of bwt conditioned on BMI quartiles and whether the mother smokes to verify your previous assertions about the shape. (m) Create side-by-side boxplots of bwt based on whether the mother smokes conditioned on BMI quartiles. Does this graph verify your findings in (l)? (n) Does it appear that BMI is related to the birth weight of a baby? Create a scatterplot of birth weight (bwt) versus BMI while conditioning on BMI quartiles and whether the mother smokes to help answer the question. (o) Replace baby birth weight (bwt) with gestation length (gestation) and answer questions (l), (m), and (n). (p) Create a scatterplot of bwt versus gestation conditioned on BMI quartiles and whether the mother smokes. Fit straight lines to the data using lm(), lqs(), and rlm(); and display the lines in the scatterplots. What do you find interesting about the resulting graphs? (q) Create a table of smoke by parity. Display the numerical results in a graph. What percent of mothers did not smoke during the pregnancy of their first child?
K14521_SM-Color_Cover.indd 56
30/06/15 11:46 am
Chapter 2:
Exploring Data
51
Solution: > site BABIES head(BABIES) 1 2 3 4 5 6
bwt gestation parity age height weight smoke 120 284 0 27 62 100 0 113 282 0 33 64 135 0 128 279 0 28 64 115 1 123 999 0 36 69 190 0 108 282 0 23 67 125 1 136 286 0 25 62 93 0
(a) > CLEAN CLEAN$smoke ggplot(data = CLEAN, aes(x = bwt, y = ..density..)) + + geom_histogram(fill = "lightpink") + + geom_density(color = "red") + + facet_grid(smoke ~.) + + xlim(30, 180) + + theme_bw()
Non−Smoker
0.02
density
0.01
0.00
0.02 Smoker
0.01
0.00 50
100
bwt
150
(c) Based on the density histograms in part (b), the distributions of birth weights for both smoking and non-smoking mothers are unimodal and symmetric. The mean and standard deviation for birth weights of non-smoking mothers are 123.0853 and 17.4237 ounces, respectively. The mean and standard deviation for birth weights of smoking mothers are 113.8192 and 18.295 ounces, respectively.
K14521_SM-Color_Cover.indd 57
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
52
> mean(CLEAN$bwt[CLEAN$smoke == "Non-Smoker"]) [1] 123.0853 > sd(CLEAN$bwt[CLEAN$smoke == "Non-Smoker"]) [1] 17.4237 > mean(CLEAN$bwt[CLEAN$smoke == "Smoker"]) [1] 113.8192 > sd(CLEAN$bwt[CLEAN$smoke == "Smoker"]) [1] 18.29501 (d) The mean birth weight difference between non-smoking and smoking mother’s birth weights is 9.2661 ounces. > ANS ANS Non-Smoker 123.0853
Smoker 113.8192
> DIFF DIFF Non-Smoker 9.266143 (e) > boxplot(bwt ~ smoke, data = CLEAN) > bwplot(bwt ~ smoke, data = CLEAN) > ggplot(data = CLEAN, aes(x = smoke, y = bwt)) + + geom_boxplot() + + theme_bw() 175
180
180
140
140
120
160
160
120
150
100
bwt
bwt
125
100
80
100
80
60
75
Non−Smoker
Smoker
60 Non−Smoker
Smoker
50
Non−Smoker
smoke
Smoker
(f) The median birth weight difference between firstborn babies and those that are not firstborn is 2 ounces.
K14521_SM-Color_Cover.indd 58
30/06/15 11:46 am
Chapter 2:
Exploring Data
53
> ANS ANS 0 1 120 118 > DIFF DIFF 0 2 (g) > ggplot(data = CLEAN, aes(x = weight, color = smoke)) + + geom_density() + + theme_bw()
0.020
0.015
density
smoke Non−Smoker Smoker
0.010
0.005
0.000 100
150
weight
200
250
(h) The distribution of pre-pregnancy weight for both smokers and non-smokers is unimodal and skewed to the right. The median and IQR of pre-pregnancy weight for smokers are 125 and 24.5 pounds, respectively. The median and IQR of pre-pregnancy weight for non-smokers are 126 and 25 pounds, respectively. > median(CLEAN$weight[CLEAN$smoke == "Smoker"]) [1] 125 > IQR(CLEAN$weight[CLEAN$smoke == "Smoker"]) [1] 24.5 > median(CLEAN$weight[CLEAN$smoke == "Non-Smoker"]) [1] 126 > IQR(CLEAN$weight[CLEAN$smoke == "Non-Smoker"]) [1] 25
K14521_SM-Color_Cover.indd 59
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
54
(i) The mean pre-pregnancy weight difference between non-smokers and smokers is 2.5603 pounds. The mean should not be used as a measure of center in this problem since both distributions are skewed. > ANS ANS Non-Smoker 129.4797
Smoker 126.9194
> DIFF DIFF Non-Smoker 2.56033 (j) > CLEANP ggplot(data = CLEANP, aes(x = BMI)) + geom_density() + theme_bw()
K14521_SM-Color_Cover.indd 60
30/06/15 11:46 am
Chapter 2:
Exploring Data
55
0.15
density
0.10
0.05
0.00 15
20
25
BMI
30
35
40
(l) The requested answers are computed in the following R Code 2.1. Based on the values, birth weight in each quartile appears to be symmetric regardless of the mother’s smoking status. R Code 2.1 > > + + > +
values tapply(CLEANP$bwt, list(CLEANP$Quartiles, CLEANP$smoke), + IQR) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]
Non-Smoker Smoker 20.0 25.25 21.0 21.75 19.5 24.00 23.0 22.00
> ggplot(data = CLEANP, aes(x = bwt)) + geom_histogram(fill="lightblue") + + theme_bw() + facet_grid(smoke ~ Quartiles) > ggplot(data = CLEANP, aes(x = bwt)) + geom_density() + + theme_bw() + facet_grid(smoke ~ Quartiles)
[15.7,19.9]
(19.9,21.3]
(21.3,23.3]
(23.3,40.4]
20 Non−Smoker
count
10
0
20 Smoker
10
0 80
K14521_SM-Color_Cover.indd 62
120
160
80
120
160
bwt
80
120
160
80
120
160
30/06/15 11:46 am
Chapter 2: [15.7,19.9]
Exploring Data
(19.9,21.3]
(21.3,23.3]
57 (23.3,40.4]
0.02 Non−Smoker
0.01
density
0.00
0.02 Smoker
0.01
0.00 50 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 175
bwt
(m) The boxplots also suggest the distribution of bwt is symmetric for both smokers and non-smokers in each quartile. > ggplot(data = CLEANP, aes(x = smoke, y = bwt)) + + geom_boxplot() + + facet_grid(Quartiles~.) + + theme_bw() 175 [15.7,19.9]
150 125 100 75 50 175
(19.9,21.3]
150 125 100 75
bwt
50 175 (21.3,23.3]
150 125 100 75 50 175
(23.3,40.4]
150 125 100 75 50
Non−Smoker
smoke
Smoker
(n) There appears to be no association between birth weight and BMI.
K14521_SM-Color_Cover.indd 63
30/06/15 11:46 am
58
Probability and Statistics with R, Second Edition: Exercises and Solutions
> ggplot(data = CLEANP, aes(x = BMI, y = bwt)) + + geom_point() + + facet_grid(smoke ~ Quartiles) + + theme_bw() [15.7,19.9]
(19.9,21.3]
(21.3,23.3]
(23.3,40.4]
175 150 Non−Smoker
125 100
bwt
75 50
175 150
Smoker
125 100 75 50
15 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 40
BMI
(o) The mean, standard deviation, median, and IQR for gestation grouped according to BMI quartile and smoking status are computed in R Code 2.2. Based on the values, gestation in each quartile appears to be symmetric regardless of the mother’s smoking status. R Code 2.2 > tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + mean) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]
Non-Smoker 282.8938 279.0331 277.4372 280.4764
Smoker 277.2132 277.4649 279.6636 277.4412
> tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + sd) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]
Non-Smoker 14.57214 14.57810 20.08376 15.48780
Smoker 14.55330 14.40082 15.33890 16.77727
> tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + median)
K14521_SM-Color_Cover.indd 64
30/06/15 11:46 am
Chapter 2:
[15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]
Exploring Data
59
Non-Smoker Smoker 283 279 281 280 279 279 281 277
> tapply(CLEANP$gestation, list(CLEANP$Quartiles, CLEANP$smoke), + IQR) [15.7,19.9] (19.9,21.3] (21.3,23.3] (23.3,40.4]
Non-Smoker Smoker 14.25 16.0 16.00 14.5 15.00 17.0 16.50 14.0
The histograms and density plots confirm a symmetric distribution of gestation regardless of BMI quartile or mother’s smoking status. > ggplot(data = CLEANP, aes(x = gestation)) + + geom_histogram(fill = "lightblue") + + theme_bw() + + facet_grid(smoke ~ Quartiles) > ggplot(data = CLEANP, aes(x = bwt)) + + geom_density() + + theme_bw() + + facet_grid(smoke ~ Quartiles)
[15.7,19.9]
(19.9,21.3]
(21.3,23.3]
(23.3,40.4]
40
Non−Smoker
30
20
count
10
0
40
30 Smoker
20
10
0 150 200 250 300 350 150 200 250 300 350 150 200 250 300 350 150 200 250 300 350
gestation
K14521_SM-Color_Cover.indd 65
30/06/15 11:46 am
60
Probability and Statistics with R, Second Edition: Exercises and Solutions [15.7,19.9]
(19.9,21.3]
(21.3,23.3]
(23.3,40.4]
0.02 Non−Smoker
0.01
density
0.00
0.02 Smoker
0.01
0.00 50 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 17550 75 100 125 150 175
bwt
The boxplots also suggest the distribution of gestation is symmetric for both smokers and non-smokers in each quartile. > ggplot(data = CLEANP, aes(x = smoke, y = gestation)) + + geom_boxplot() + + facet_grid(Quartiles~.) + + theme_bw() 350 [15.7,19.9]
300 250 200 150 350
(19.9,21.3]
300 250
gestation
200 150 350
(21.3,23.3]
300 250 200 150 350
(23.3,40.4]
300 250 200 150 Non−Smoker
smoke
Smoker
There appears to be no association between gestation and BMI.
K14521_SM-Color_Cover.indd 66
30/06/15 11:46 am
Chapter 2:
Exploring Data
61
> ggplot(data = CLEANP, aes(x = BMI, y = gestation)) + + geom_point() + + facet_grid(smoke ~ Quartiles) + + theme_bw()
[15.7,19.9]
(19.9,21.3]
(21.3,23.3]
(23.3,40.4]
350
300 Non−Smoker
250
gestation
200
150 350
300 Smoker
250
200
150 15 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 4015 20 25 30 35 40
BMI
(p) There seems to be less variability among the three model fits for smoking mothers versus the non-smoking mothers.
> > > + > > > + + + + + +
K14521_SM-Color_Cover.indd 67
library(MASS) lqsmod prop.table(T1, 2) pclass survived 1st 2nd 3rd No 0.3808050 0.5703971 0.7447109 Yes 0.6191950 0.4296029 0.2552891 (b) 8.0978% of women in third class survived while 4.66% of men in first class survived. > T2 T2 , , survived = No sex pclass female male 1st 5 118 2nd 12 146 3rd 110 418 , , survived = Yes sex pclass female male 1st 139 61 2nd 94 25 3rd 106 75 > prop.table(T2) , , survived = No sex pclass female male 1st 0.003819710 0.090145149 2nd 0.009167303 0.111535523 3rd 0.084033613 0.319327731 , , survived = Yes sex pclass female male 1st 0.106187930 0.046600458
K14521_SM-Color_Cover.indd 70
30/06/15 11:46 am
Chapter 2:
Exploring Data
65
2nd 0.071810542 0.019098549 3rd 0.080977846 0.057295646 (c) The distribution age is bimodal and skewed to the right. The median is 28, and the IQR is 18. > median(TITANIC3$age, na.rm = TRUE) [1] 28 > IQR(TITANIC3$age, na.rm = TRUE) [1] 18 > ggplot(data = TITANIC3, aes(x = age)) + + geom_density(fill = "pink") + + theme_bw()
0.03
density
0.02
0.01
0.00 0
20
40
age
60
80
(d) With out considering pclass, the mean and median age for surviving females was higher than the mean and median age for males who survived. When pclass is taken into account, the mean age for females is greater than the mean age for males except in third class, while the median age for females is greater than the median age for males only in second class. > with(data = TITANIC3, tapply(age, list(survived, sex), mean, + na.rm = TRUE)) female male No 25.25521 31.51641 Yes 29.81535 26.97778 > with(data = TITANIC3, tapply(age, list(survived, sex), sd, + na.rm = TRUE))
K14521_SM-Color_Cover.indd 71
30/06/15 11:46 am
66
Probability and Statistics with R, Second Edition: Exercises and Solutions
female male No 13.47688 13.79635 Yes 14.76928 15.55388 > with(data = TITANIC3, tapply(age, list(survived, sex), median, + na.rm = TRUE)) No Yes
female male 24.5 29 28.5 27
> with(data = TITANIC3, tapply(age, list(survived, sex), IQR, + na.rm = TRUE)) No Yes
female male 13.25 18.0 19.00 16.5
> with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), mean, na.rm = TRUE)) , , female No Yes 1st 35.20000 37.10938 2nd 34.09091 26.71105 3rd 23.41875 20.81482 , , male No Yes 1st 43.65816 36.16824 2nd 33.09259 17.44927 3rd 26.67960 22.43644 > with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), sd, na.rm = TRUE)) , , female No Yes 1st 23.44568 13.93813 2nd 14.05315 12.62080 3rd 12.04303 12.32179 , , male No Yes 1st 13.66284 15.09160 2nd 12.13161 16.70854 3rd 11.75896 10.70842 > with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), median, na.rm = TRUE))
K14521_SM-Color_Cover.indd 72
30/06/15 11:46 am
Chapter 2:
Exploring Data
67
, , female No Yes 1st 36.0 35.5 2nd 29.0 27.5 3rd 22.5 22.0 , , male No Yes 1st 45 36 2nd 30 19 3rd 25 25 > with(data = TITANIC3, tapply(age, list(pclass, survived, + sex), IQR, na.rm = TRUE)) , , female No Yes 1st 25.000 24.00 2nd 16.000 14.25 3rd 13.125 12.00 , , male No Yes 1st 22.125 21.0 2nd 16.000 27.5 3rd 13.000 10.5 (e) Both the mean and median age for males who survived were lower than the mean and median age for males who did not survive with the exception that the median age for surviving and non-surviving males were the same in third class. (f) The youngest female in first class to survive was 14 years old. > with(data = TITANIC3, + sort(age[sex =="female" & survived =="Yes" & pclass == "1st"]))[1] [1] 14 (g) Answers will vary. 9. Use the CARS2004 data frame from the PASWR2 package, which contains the numbers of cars per 1000 inhabitants (cars), the total number of known mortal accidents (deaths), and the country population/1000 (population) for the 25 member countries of the European Union for the year 2004. (a) Compute the total number of cars per 1000 inhabitants in each country, and store the result in an object named total.cars. Determine the total number of known automobile fatalities in 2004 divided by the total number of cars for each country and store the result in an object named death.rate.
K14521_SM-Color_Cover.indd 73
30/06/15 11:46 am
68
Probability and Statistics with R, Second Edition: Exercises and Solutions
(b) Create a barplot showing the automobile death rate for each of the European Union member countries. Make the bars increase in magnitude so that the countries with the smallest automobile death rates appear first. (c) Which country has the lowest automobile death rate? Which country has the highest automobile death rate? (d) Create a scatterplot of population versus total.cars. How would you characterize the relationship? (e) Find the least squares estimates for regressing population on total.cars. Superimpose the least squares line on the scatterplot from (d). What population does the least squares model predict for a country with a total.cars value of 19224.630? Find the difference between the population predicted from the least squares model and the actual population for the country with a total.cars value of 19224.630. (f) Create a scatterplot of total.cars versus death.rate. How would you characterize the relationship between the two variables? (g) Compute Spearman’s rank correlation coefficient of total.cars and death.rate. (Hint: Use cor(x, y, method="spearman").) What is this coefficient measuring? (h) Plot the logarithm of total.cars versus the logarithm of death.rate. How would you characterize the relationship? (i) What are the least squares estimates for the regression of log(total.cars) on log(death.rate). Superimpose the least squares line on the scatterplot from (h). What total number of cars does the least squares model predict for a country with a log(death.rate) value of -3.769252? Make sure you express your answer in the same units as those used for total.cars. Solution: (a) > CARS2004 head(CARS2004) country cars deaths population death.rate total.cars 1 Belgium 467 112 10396 0.02306932 4854.932 2 Czech Republic 373 135 10212 0.03544167 3809.076 3 Denmark 354 68 5398 0.03558548 1910.892 4 Germany 546 71 82532 0.00157559 45062.472 5 Estonia 350 126 1351 0.26646928 472.850 6 Greece 348 147 11041 0.03825865 3842.268 (b) > ggplot(data =CARS2004, aes(x = reorder(country, death.rate), + y = death.rate)) + + geom_bar(stat = "identity", fill = "red") +
K14521_SM-Color_Cover.indd 74
30/06/15 11:46 am
Chapter 2: + + + +
Exploring Data
69
coord_flip() + labs(x = "", y = "Death Rate", title = "European 2004 Vehicular Death Rate") + theme_bw()
European 2004 Vehicular Death Rate
Cyprus Luxembourg Latvia Estonia Lithuania Malta Slovenia Slovakia Ireland Hungary Greece Denmark Czech Republic Finland Austria Belgium Portugal Sweden Poland Netherlands Spain France Italy United Kingdom Germany 0.0
0.1
0.2
Death Rate
0.3
0.4
0.5
(c) The country with the lowest automobile death rate is Germany while Cyprus has the highest automobile death rate. (d) There is a positive curvilinear relationship between total.cars and population.
> ggplot(data = CARS2004, aes(x = total.cars, y = population)) + + geom_point() + + geom_smooth() + + theme_bw()
K14521_SM-Color_Cover.indd 75
30/06/15 11:46 am
70
Probability and Statistics with R, Second Edition: Exercises and Solutions
75000
population
50000
25000
0
0
10000
20000
total.cars
30000
40000
(e) > mod.lm summary(mod.lm) Call: lm(formula = population ~ total.cars, data = CARS2004) Residuals: Min 1Q Median -7500 -1840 -1013
3Q 1015
Max 13510
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.124e+03 9.731e+02 2.183 0.0395 * total.cars 1.881e+00 6.561e-02 28.668 ggplot(data = CARS2004, aes(x = total.cars, y = population)) + + geom_point() + + geom_smooth(method = "lm") + + theme_bw() > POP POP
K14521_SM-Color_Cover.indd 76
30/06/15 11:46 am
Chapter 2:
Exploring Data
71
1 38285550 > resid(mod.lm)[7]*1000 7 4059450
population
75000
50000
25000
0 0
10000
20000
total.cars
30000
40000
The least squares model predicts a population of 38285550.4948 people. Spain has a total.cars value of 19224.630 and a reported population of 42,345,000. The difference between Spain’s actual population and the value predicted with least squares is the seventh residual 42, 345, 000 − 38, 285, 550 = 4, 059, 450. (f) There is a decreasing monotonic relationship between total.cars and death.rate.
> ggplot(data = CARS2004, aes(x = death.rate, y = total.cars)) + + geom_point() + + theme_bw()
K14521_SM-Color_Cover.indd 77
30/06/15 11:46 am
72
Probability and Statistics with R, Second Edition: Exercises and Solutions
40000
total.cars
30000
20000
10000
0 0.0
0.1
0.2
death.rate
0.3
0.4
0.5
(g) Spearman’s rank correlation is a measure of the monotonic relationship between two variables.
> with(data = CARS2004, cor(total.cars, death.rate, method = "spearman")) [1] -0.9676923
(h) The relationship is strong, negative, and linear between the logarithm of total.cars and the logarithm of death.rate.
> ggplot(data = CARS2004, aes(x = log(death.rate), y = log(total.cars))) + + geom_point() + + theme_bw()
K14521_SM-Color_Cover.indd 78
30/06/15 11:46 am
Chapter 2:
Exploring Data
73
10
log(total.cars)
9
8
7
6
−6
−4
log(death.rate)
−2
(i) The total number of cars predicted for a country with a logdeath.rate = -3.769252 is 4231.018 cars.
> ggplot(data = CARS2004, aes(x = log(death.rate), y = log(total.cars))) + + geom_point() + + theme_bw() + + geom_smooth(method = "lm") > modlm.log coef(summary(modlm.log)) Estimate Std. Error t value Pr(>|t|) (Intercept) 5.0206666 0.19568324 25.65711 1.994256e-18 log(death.rate) -0.8833401 0.05142204 -17.17824 1.293676e-14 > TOTCARS TOTCARS 1 4231.018
K14521_SM-Color_Cover.indd 79
30/06/15 11:46 am
74
Probability and Statistics with R, Second Edition: Exercises and Solutions
11
10
log(total.cars)
9
8
7
6
−6
−4
log(death.rate)
−2
10. The data frame SURFACESPAIN in the PASWR2 package contains the surface area (km2 ) for seventeen autonomous Spanish communities. (a) Use the function merge() to combine the data frames WHEATSPAIN (from Problem 3) and SURFACESPAIN into a new data frame named DataSpain. (b) Create a variable named surface.h containing the surface area of each autonomous community in hectares. (Note: 100 hectares = 1 km2 .) Create a variable named wheat.p containing the percent surface area in each autonomous community dedicated to growing wheat. Add the newly created variables to the data frame DataSpain and store the result as a data frame with the name DataSpain.m. (c) Assign the names of the autonomous communities as row names for DataSpain.m and remove the variable community from the data frame. (d) Create a barplot showing the percent surface area dedicated to growing wheat for each of the seventeen Spanish autonomous communities. Arrange the communities by decreasing percentages. (e) Display the percent surface area dedicated to growing wheat for each of the seventeen Spanish autonomous communities using the function dotchart(). To read about dotchart(), type ?dotchart at the command prompt. Do you prefer the barchart or the dotchart? Explain your answer. (f) Describe the relationship between the surface area in an autonomous community dedicated to growing wheat (hectares) and the total surface area of the autonomous community (surface.h). (g) Describe the relationship between the surface area in an autonomous community dedicated to growing wheat (hectares) and the percent of surface area dedicated to growing wheat out of the communities’ total surface area (wheat.p).
K14521_SM-Color_Cover.indd 80
30/06/15 11:46 am
Chapter 2:
Exploring Data
75
(h) Develop a model to predict the surface area in an autonomous community dedicated to growing wheat (hectares) based on the total surface area of the autonomous community (surface.h). Solution: (a) > DataSpain head(DataSpain) community hectares acres cuts surface 1 Andalucia 558292 1379569.6 (3.6e+05,1.55e+06] 87268 2 Aragon 311479 769681.4 (3.6e+05,1.55e+06] 47719 3 Asturias 65 160.6 (0,1e+05] 10604 4 Baleares 7203 17799.0 (0,1e+05] 4992 5 C.Valenciana 6111 15100.6 (0,1e+05] 23255 6 Canarias 100 247.1 (0,1e+05] 7447 (b) > DataSpain.m DataSpain.m[1:6, c(1,2,3,6,7)] community hectares acres wheat.p surface.h 1 Andalucia 558292 1379569.6 6.397442361 8726800 2 Aragon 311479 769681.4 6.527358075 4771900 3 Asturias 65 160.6 0.006129762 1060400 4 Baleares 7203 17799.0 1.442908654 499200 5 C.Valenciana 6111 15100.6 0.262782197 2325500 6 Canarias 100 247.1 0.013428226 744700 (c) > > > >
COM with(data = DataSpain.m[order(DataSpain.m$wheat.p), ], + dotchart(wheat.p, pch = 19, + labels = row.names(DataSpain.m[order(DataSpain.m$wheat.p), ]), + xlab = "Percent of Community Surface Area Dedicated to Growing Wheat"))
K14521_SM-Color_Cover.indd 82
30/06/15 11:46 am
Chapter 2:
Exploring Data
77
La Rioja Castilla−Leon
La Rioja
Aragon
Castilla−Leon
Andalucia
Aragon
Navarra
Andalucia
P.Vasco
Navarra
Extremadura
P.Vasco Extremadura
Castilla−La Mancha
Castilla−La Mancha
Cataluna
Cataluna Madrid
Madrid
Baleares
Baleares
Murcia
Murcia
Galicia
Galicia
C.Valenciana Cantabria
C.Valenciana
Canarias
Cantabria
Asturias
Canarias
0
Asturias 0
2
4
6
1
2
3
4
5
6
7
Percent of Community Surface Area Dedicated to Growing Wheat
Percent of Community Surface Area Dedicated to Growing Wheat
(f) There is a positive linear relationship between surface.h and hectares. > ggplot(data = DataSpain.m, aes(x = surface.h, y = hectares)) + + geom_point() + + labs(y = "Surface Area Dedicated to Growing Wheat", + x = "Total Surface Area of Community") + + theme_bw()
Surface Area Dedicated to Growing Wheat
6e+05
4e+05
2e+05
0e+00 2500000
5000000
Total Surface Area of Community
7500000
(g) There is a weak positive association between wheat.p (percent of surface area dedicated to growing wheat) and hectares (wheat growing area) that spreads out in a funel fashion as the percent of surface area dedicated to growing wheat increases.
K14521_SM-Color_Cover.indd 83
30/06/15 11:46 am
78
Probability and Statistics with R, Second Edition: Exercises and Solutions
> ggplot(data = DataSpain.m, aes(x = wheat.p, y = hectares)) + + geom_point() + + labs(y = "Surface Area Dedicated to Growing Wheat", + x = "Percent Surface Area Dedicated to Growing Wheat") + + theme_bw()
Surface Area Dedicated to Growing Wheat
6e+05
4e+05
2e+05
0e+00 0
2
4
Percent Surface Area Dedicated to Growing Wheat
6
(h) = −52526.918 + The least squares line from regressing hectares on surface.h is hectares 0.0602 × surface.h. > model coef(summary(model)) Estimate Std. Error t value Pr(>|t|) (Intercept) -5.252692e+04 2.597808e+04 -2.021971 6.139094e-02 surface.h 6.021268e-02 6.198023e-03 9.714820 7.301776e-08
K14521_SM-Color_Cover.indd 84
30/06/15 11:46 am
Chapter 3 General Probability and Random Variables
1. How many ways can a host randomly choose 8 people out of 90 in the audience to participate in a TV game show? Solution: > choose(90, 8) [1] 77515521435 There are 90 8 = 77, 515, 435 ways to choose 8 people from 90. 2. How many different six-place license plates are possible if the first two places are letters and the remaining places are numbers? Solution: > 26 * 26 * 10 * 10 * 10 * 10 [1] 6760000 There are a total of 6760000 possible license plates. 3. How many different six-place license plates are possible (first two places letters, remaining places numbers) if repetition among letters and numbers is not permissible? Solution: > 26 * 25 * 10 * 9 * 8 [1] 468000 There are a total of 468000 possible license plates if repetition among letters and numbers is not permissible. 4. Susie has 25 books she would like to arrange on her desk. Of the 25 books, 7 are statistics books, 6 are biology books, 5 are English books, 4 are history books, and 3 are psychology books. If Susie arranges her books by subject, how many ways can she arrange her books? Solution:
79
K14521_SM-Color_Cover.indd 85
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
80
> factorial(5) * factorial(7) * factorial(6) * factorial(5) * + factorial(4) * factorial(3) [1] 7.52468e+12 Susie can arrange her books 7524679680000 different ways. 5. A hat contains 20 consecutive numbers (1 to 20). If four numbers are drawn at random, how many ways are there for the largest number to be a 16 and the smallest number to be a 5? Solution: > choose(10, 2) [1] 45 There is one way for the smallest number to be a 5 and the largest number to be a 16. This leaves two numbers to be drawn between the 6 and 15. There are 10 numbers between 6 and 15 inclusive, so the remaining two numbers can be selected 10 2 = 45 different ways. 6. A university committee of size 10, consisting of 2 faculty from the college of fine and applied arts, 2 faculty from the college of business, 3 faculty from the college of arts and sciences, and 3 administrators, is to be selected from 6 fine and applied arts faculty, 7 college of business faculty, 10 college of arts and sciences faculty, and 5 administrators. How many committees are possible? Solution: > TN TN [1] 378000 There are a total of 378000 possible committees. 7. How many different letter arrangements can be made from the letters BIOLOGY, PROBABILITY, and STATISTICS, respectively? Solution: > > > >
BA choose(12, 4) * choose(8, 5) [1] 27720 The 12 rooms can be painted in a total of 27720 possible ways. 9. A shipment of 50 laptops includes 3 that are defective. If an instructor purchases 4 laptops from the shipment to use in his class, how many ways are there for the instructor to purchase at least 2 of the defective laptops? Solution: > choose(3, 2)*choose(47, 2) + choose(3, 3)*choose(47, 1) [1] 3290
10. A multiple-choice test consists of 10 questions. Each question has 5 answers (only one is correct). How many different ways can a student fill out the test? Solution: > 5^10 [1] 9765625
11. How many ways can five politicians stand in line? In how many ways can they stand in line if two of the politicians refuse to stand next to each other? Solution: > factorial(5) [1] 120 > factorial(5) - 2 * factorial(4) [1] 72 There are 120 ways five politicians can stand in line. If two of the politicians refuse to stand next to each other, there are 72 ways they may stand in line.
K14521_SM-Color_Cover.indd 87
30/06/15 11:46 am
82
Probability and Statistics with R, Second Edition: Exercises and Solutions
12. There are five different colored jerseys worn throughout the Tour de France. The yellow jersey is worn by the rider with the least accumulated time; the green jersey is worn by the best sprinter; the red and white polka dot jersey is worn by the best climber. The white jersey is worn by the best youngest rider, and the red jersey is worn by the rider with the most accumulated time still in the race. If 150 riders finish the Tour, how many different ways can the yellow, green, and red and white polka dot jerseys be awarded if (a) a rider can receive any number of jerseys and (b) each rider can receive at most one jersey? Solution: > 150^3
# part a
[1] 3375000 > 150*149*148
# part b
[1] 3307800
13. A president, treasurer, and secretary, all different, are to be chosen from among the 10 active members of a university club. How many different choices are possible if (a) There are no restrictions. (b) A will serve only if she is the treasurer. (c) B and C will not serve together. (d) D and E will serve together or not at all. (e) F must be an officer. Solution: (a) > 10 * 9 * 8 [1] 720 (b) > 9 * 8 * 7 + 9 * 8 [1] 576 (c) > 8 * 7 * 6 + 3 * 2 * 8 * 7 [1] 672 (d)
K14521_SM-Color_Cover.indd 88
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
83
> 8 * 7 * 6 + 3 * 2 * 8 [1] 384 (e) > 3 * 9 * 8 [1] 216
14. On a multiple-choice exam with three possible answers for each of the five questions, what is the probability that a student would get four or more correct answers just by guessing? Solution: > choose(5, 4)*(1/3)^4*(2/3)^1 + choose(5, 5)*(1/3)^5*(2/3)^0 [1] 0.04526749
15. Suppose four balls are chosen at random without replacement from an urn containing six black balls and four red balls. What is the probability of selecting two balls of each color? Solution: > choose(4, 2) * choose(6, 2)/choose(10, 4) [1] 0.4285714
16. What is the probability that a hand of five cards chosen randomly and without replacement from a standard deck of 52 cards contains the ace of hearts, exactly one other ace, and exactly two kings? Solution: > choose(1, 1)*choose(3, 1)*choose(4, 2)*choose(44, 1)/choose(52, 5) [1] 0.0003047373
17. In the New York State lottery game, six of the numbers 1 through 54 are chosen by a customer. Then, in a televised drawing, six of these numbers are selected. If all six of a customer’s numbers are selected, then that customer wins a share of the first prize. If five or four of the numbers are selected, the customer wins a share of the second or the third prize. What is the probability that any customer will win a share of the first prize, the second prize, and the third prize, respectively? Solution:
K14521_SM-Color_Cover.indd 89
30/06/15 11:46 am
84
Probability and Statistics with R, Second Edition: Exercises and Solutions
> choose(6, 6)/choose(54, 6)
# Pr(first prize)
[1] 3.871892e-08 > choose(6, 5)*choose(48, 1)/choose(54, 6)
# Pr(second prize)
[1] 1.115105e-05 > choose(6, 4)*choose(48, 2)/choose(54, 6)
# Pr(third prize)
[1] 0.0006551242
18. An office supply store is selling packages of 100 CDs at a very affordable price. However, roughly 10% of all packages are defective. If a package of 100 CDs containing exactly 10 defective CDs is purchased, find the probability that exactly 2 of the first 5 CDs used are defective. Solution: > (choose(10, 2) * choose(90, 3))/choose(100, 5) [1] 0.07021881
19. A box contains six marbles, two of which are black. Three are drawn with replacement. What is the probability two of the three are black? Solution: > fractions(choose(3, 2)*(1/3)^2*(2/3)^1) [1] 2/9
20. The ASU triathlon club consists of 11 women and 7 men. What is the probability of selecting a committee of size four with exactly three women? Solution: > choose(11, 3)*choose(7, 1)/choose(18, 4) [1] 0.377451 > # or > (11/18)*(10/17)*(9/16)*(7/15)*4 [1] 0.377451
21. Four golf balls are to be placed in six different containers. One ball is red; one, green; one, blue; and one, yellow.
K14521_SM-Color_Cover.indd 90
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
85
(a) In how many ways can the four golf balls be placed into six different containers? Assume that any container can contain any number of golf balls (as long as there are a total of four golf balls). (b) In how many ways can the golf balls be placed if container one remains empty? (c) In how many ways can the golf balls be placed if no two golf balls can go into the same container? (d) What is the probability that no two golf balls are in the same container, assuming that the balls are randomly tossed into the containers? Solution: (a) 64 = 1296 (b) 54 = 625 (c) 6 · 5 · 4 · 3 = 360 (d)
6·5·4·3 64
=
360 1296
=
5 18
22. Three dice are thrown. What fraction of the time does a sum of 9 appear on the faces? What percent of the time does a sum of 10 appear? Solution: > SS SUM fractions(mean(SUM == 9)) [1] 25/216 > mean(SUM == 10)*100 [1] 12.5 A sum of 9 appears 1157/10000 of the time. A sum of 10 appears 12.5% of the time. 23. Assume that P(A) = 0.5, P(A ∩ C) = 0.2, P(C) = 0.4, P(B) = 0.4, P(A ∩ B ∩ C) = 0.1, P(B ∩ C) = 0.2, and P(A ∩ B) = 0.2. Calculate the following probabilities: (a) P(A ∪ B ∪ C) (b) P(Ac ∩ (B ∪ C)) (c) P ((B ∩ C)c ∪ (A ∩ B)c ) (d) P(A) − P(A ∩ C) Solution: (a) 0.8, (b) 0.3, (c) 0.9, (d) 0.3 24. In a 10k race where three runners, Susie, Mike, and Anna, enter the race with identical personal best times, assume they all have an equal chance of winning today’s 10k. Consider the events: • E1 : Susie wins the 10k.
K14521_SM-Color_Cover.indd 91
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
86
• E2 : Susie places second in the 10k. • E3 : Susie places third in the 10k. • W : Susie places higher than Mike. Is W independent of E1 , E2 , and E3 ? Solution: The sample space of outcomes is Ω = {SAM, SM A, AM S, ASM, M SA, M AS}, where each person in each place is represented by their name’s first initial. Since P(W ) =
3 1 = , 6 2
P(E1 ) =
1 2 = , 6 3
P(E2 ) =
1 2 = , 6 3
P(E3 ) =
1 2 = , 6 3
and 1/3 P(W ∩ E1 ) = = 1, P(E1 ) 1/3 1/6 1 P(W ∩ E2 ) P(W |E2 ) = = = , P(E2 ) 1/3 2 0 P(W ∩ E3 ) = = 0, P(W |E3 ) = P(E3 ) 1/3
P(W |E1 ) =
we can say W is independent of E2 but not of E1 or of E3 . 25. Verify that P(F |E) satisfies the three axioms of probability. Solution: It must be shown that P(F |E) =
P(F ∩ E) satisfies P(E)
(1) 0 ≤ P(F |E) ≤ 1 (2) P(Ω|E) = 1 (3) P(∪∞ i=1 Fi |E) =
n
i=1
P(Fi |E)
(1) The left side is obvious since all probabilities must be greater than or equal to zero. Since F ∩ E ⊂ E, it follows that P(F ∩ E) ≤ P(E) which is less than or equal to one. P(Ω ∩ E) P(E) (2) P(Ω|E) = = =1 P(E) P(E) (3) P ∪∞ i=1 (Fi ∩ E) ∞ P(∪i=1 Fi |E) = P(E) ∞ P(Fi ∩ E) = P(E) i=1 =
n i=1
K14521_SM-Color_Cover.indd 92
P(Fi |E)
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
87
26. If A and B are independent events, show that Ac and B c are also independent events. Solution: If A and B are independent, then P(A ∩ B) = P(A)P(B). P(Ac ∩ B c ) = P((A ∪ B)c ) = 1 − P(A ∪ B) = 1 − P(A) − P(B) + P(A ∩ B) = 1 − P(A) − P(B) + P(A)P(B)
= 1 − P(A)(1 − P(B)) − P(B) = 1 − P(B) − P(A)(1 − P(B)) = (1 − P(B))(1 − P(A)) = P(Ac )P(B c )
Consequently, Ac and B c are also independent. 27. Let A and B be events where 0 < P(A) < 1 and 0 < P(B) < 1. Is P(A|B)+P(Ac |B c ) = 1 true when A and B are (a) mutually exclusive? (b) independent? If P(A|B) + P(Ac |B c ) = 1 is not true for either (a) or (b), provide a counterexample. Solution: (a) If A and B are mutually exclusive, P(A|B) + P(Ac |B c ) = 1. Counterexample: Consider rolling a fair die and let the event A be rolling an even number and the event B be rolling an odd number. Then, P(A|B) + P(Ac |B c ) = 0 + 0 = 0 = 1. (b) If A and B are independent, then P(A|B) + P(Ac |B c ) = 1 because P(A|B) + P(Ac |B c ) = P(A) + P(Ac ) = 1. Recall that if A and B are independent, Ac and B c are also independent; so P(A|B) = P(A) and P(Ac |B c ) = P(Ac ). 28. A family has three cars, all with electric windows. Car A’s windows always work. Car B’s windows work 30% of the time, and Car C’s windows work 75% of the time. The family uses Car A 32 of the time; Car B, 29 of the time; and Car C, the remaining fraction. (a) On a particularly hot day, when the family wants to roll the windows down, compute the probability the windows will work. (b) If the electric windows work, find the probability the family is driving Car C. Solution: Let A, B, and C be the events of using the cars A, B, and C respectively and let T be the event windows work properly. (a) P(T ) = P(T |A)P(A) + P(T |B)P(B) + P(T |C)P(C) =
K14521_SM-Color_Cover.indd 93
2 2 1 × 1 + × 0.3 + × 0.75 = 0.8167 3 9 9
30/06/15 11:46 am
88
Probability and Statistics with R, Second Edition: Exercises and Solutions
(b) Applying Bayes’ formula
P (C|T ) =
1 × 0.75 P (T |C)P (C) = 9 = 0.102. P (T ) 0.8167
29. A new drug test being considered by the International Olympic Committee can detect the presence of a banned substance when it has been taken by the subject in the last 90 days 98% of the time. However, the test also registers a “false positive” in 2% of the population that has never taken the banned substance. If 2% of the athletes in question are taking the banned substance, what is the probability a person that has a positive drug test is actually taking the banned substance? Solution: Let B = banned substance is present and + = a positive test. P( +|B ) =0.98 P(+|B c ) =0.02
P(B|+) =
P( B ) =0.02 P(B c ) =0.98
P(B ∩ +) P(+)
P(+|B) · P(B) P(+|B) · P(B) + P(+|B c ) · P(B c ) 0.98 × 0.02 = 0.50 = 0.98 × 0.02 + 0.02 × 0.98
=
30. The products of an agricultural firm are delivered by four different transportation companies, A, B, C, and D. Company A transports 40% of the products; company B, 30%; company C, 20%; and, finally, company D, 10%. During transportation, 5%, 4%, 2%, and 1% of the products spoil with companies A, B, C, and D, respectively. If one product is randomly selected, (a) Obtain the probability that it is spoiled. (b) If the chosen product is spoiled, derive the probability that it has been transported by company A. Solution: Let A, B, C, and D be the events associated with a company transporting a product and S be the event that the product is spoiled. P(A) = .40, P(B) = .30, P(C) = .20, and P(D) = .10. Also, P(S|A) = .05, P(S|B) = .04, P(S|C) = .02, and P(S|D) = .01
K14521_SM-Color_Cover.indd 94
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
89
(a) P(S) = P(A)P(S|A) + P(B)P(S|B) + P(C)P(S|C) + P(D)P(S|D) = (0.4)(0.05) + (0.3)(0.04) + (0.2)(0.02) + (0.1)(0.01) = 0.037 (b) P(A|S) =
P(S|A) · P(A) (0.05)(0.4) P(A ∩ S) = = = 0.5405 P(S) P(S) 0.037
31. Two lots of large glass beads are available (A and B). Lot A has four beads, two of which are chipped; and lot B has five beads, two of which are chipped. Two beads are chosen at random from lot A and passed to lot B. Then, one bead is randomly selected from lot B. Find: (a) The probability that the selected bead is chipped. (b) The probability that the two beads selected from lot A were not chipped if the bead selected from lot B is not chipped. Solution: (a) Let B1 represent the lot obtained from passing two chipped beads from A to B. Let B2 represent the lot obtained from passing one chipped bead and one non-chipped bead from A to B. Let B3 represent the lot obtained from passing two non-chipped beads from A to B. Let C be the event selecting a chipped bead. 2 2 2 2 2 2 1 4 1 1 1 0 2 2 0 = ; P(B 2 ) = = ; P(B 3 ) = = P(B 1 ) = 4 4 4 6 6 6 2 2 2 P(C) = P(C|B1 ) · P(B1 ) + P(C|B2 ) · P(B2 ) + P(C|B3 ) · P(B3 ) 18 3 4 1 3 4 2 1 = = · + · + · = 7 6 7 6 7 6 42 7 (b) P(C c |B3 ) · P(B3 ) P(C c ) 5 1 · 5 = 746 = 24 7
P(B3 |C c ) =
32. A box contains 5 defective bulbs, 10 partially defective (they start to fail after 10 hours of use), and 25 perfect bulbs. If a bulb is tested and it does not fail immediately, find the probability that the bulb is perfect. Solution: Let P = a perfect bulb and N = does not fail immediately. 25/ P(P ∩ N ) 5 P(P |N ) = = 35 40 = P(N ) /40 7
K14521_SM-Color_Cover.indd 95
30/06/15 11:46 am
90
Probability and Statistics with R, Second Edition: Exercises and Solutions
33. A salesman in a department store receives household appliances from three suppliers: I, II, and III. From previous experience, the salesman knows that 2%, 1%, and 3% of the appliances from suppliers I, II, and III, respectively, are defective. The salesman sells 35% of the appliances from supplier I, 25% from supplier II, and 40% from supplier III. If an appliance randomly selected is defective, find the probability that it comes from supplier III. Solution: Let I, II, and III be events associated with suppliers and D be the event associated with a defective appliance. P(D) = P(D|I) · P(I) + P(D|II) · P(II) + P(D|III) · P(III) = 0.02 × 0.35 + 0.01 × 0.25 + 0.03 × 0.40 = 0.0215 P(III|D) =
P(D|III) · P(III) 0.03 × 0.4 = = 0.5581 P(D) 0.0215
34. Last year, a new business purchased 25 tablets, 25 laptops, and 50 desktops, all with three year warranties. The probability a tablet has had warranty work is four times the probability a desktop has had warranty work. The probability a laptop has had warranty work is twice the probability a desktop has had warranty work. Given that 10 computers have used the warranty, (a) If a computer is a laptop, what is the probability it has had warranty work? (b) If a computer has had warranty work, what is the probability it was a laptop? (c) If a computer has had warranty work, what is the probability it was a tablet? Solution: Let T represent tablets, L represent laptops, D represent desktops, and W represent warranty use. P(T ) = 0.25, P(W |D) = p,
P(L) = 0.25, P(W |L) = 2p,
P(D) = 0.50 P(W |T ) = 4p
P(W ) = 0.10 (a) By the theorem of total probability P(W ) = P(W |T )P(T ) + P(W |L)P(L) + P(W |D)P(D) 0.10 = 4p 0.25 + 2p 0.25 + p 0.50 0.10 = 2p =⇒ p = 0.05. The probability a computer has had warranty work given that it is a laptop is P(W |L) = 2p = 0.10.
K14521_SM-Color_Cover.indd 96
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
91
(b) The probability a computer is a laptop given that it has had warranty work is P(L|W ) =
P(W |L)P(L) 0.10 · 0.25 = = 0.25. P(W ) 0.10
(c) The probability a computer is a tablet given that it has had warranty work is P(T |W ) =
P(W |T )P(T ) 0.2 · 0.25 = = 0.50 P(W ) 0.10
35. An urn contains 14 balls; 6 of them are white, and the others are black. Another urn contains 9 balls; 3 are white, and 6 are black. A ball is drawn at random from the first urn and is placed in the second urn. Then, a ball is drawn at random from the second urn. If this ball is white, find the probability that the ball drawn from the first urn was black. Solution: Let W1 = first draw is white and W2 be second draw is white. Since there are only two colors of balls, W1c = first draw is black and W2c is the second draw is black.
P(W1c ∩ W2 ) P(W2 ) P(W2 |W1c ) · P(W1c ) = P(W2 ) P(W2 |W1c ) · P(W1c ) = c P(W2 |W1 ) · P(W1c ) + P(W2 |W1 ) · P(W1 )
P(W1c |W2 ) =
= =
3 8 10 · 14 8 4 6 · 14 + 10 · 14 24 1 140 24 24 = 2 140 + 140 3 10
36. Previous to the launching of a new flavor of yogurt, a company has conducted taste tests with four new flavors: lemon, strawberry, peach, and cherry. It obtained the following 2 3 4 probabilities of a successful launch: P(lemon) = 10 , P(strawberry) = 10 , P(peach) = 10 , 5 and P(cherry) = 10 . Let X be the random variable “number of successful flavors launched.” Obtain its probability mass function. Solution: Let L = lemon, S = strawberry, P = peach, and C = cherry launched successfully.
K14521_SM-Color_Cover.indd 97
30/06/15 11:46 am
92
Probability and Statistics with R, Second Edition: Exercises and Solutions
P(X = 4) = P(L ∩ S ∩ P ∩ C) 2 3 4 5 = · · · 10 10 10 10 = 0.012 P(X = 3) = P(L ∩ S ∩ P ∩ C c ) + P(L ∩ S ∩ P c ∩ C) + P(L ∩ S c ∩ P ∩ C) + P(Lc ∩ S ∩ P ∩ C) 2·3·4·5 2·3·6·5 2·7·4·5 8·3·4·5 + + + = 104 104 104 104 = 0.106 P(X = 2) = P(L ∩ S ∩ P c ∩ C c ) + P(L ∩ S c ∩ P c ∩ C) + P(Lc ∩ S c ∩ P ∩ C) + P(Lc ∩ S ∩ P c ∩ C) + P(Lc ∩ S ∩ P ∩ C c ) + P(L ∩ S c ∩ P ∩ C c ) 2·3·6·5 2·7·6·5 8·7·4·5 8·3·6·5 8·3·4·5 2·7·4·5 + + + + + = 104 104 104 104 104 104 = 0.32 P(X = 1) = P(L ∩ S c ∩ P c ∩ C c ) + P(Lc ∩ S ∩ P c ∩ C c ) + P(Lc ∩ S c ∩ P ∩ C c ) + P(Lc ∩ S c ∩ P c ∩ C) 2·7·6·5 8·3·6·5 8·7·4·5 8·7·6·5 + + + = 104 104 104 104 = 0.394 P(X = 0) = P(Lc ∩ S c ∩ P c ∩ C c ) 8·7·6·5 = 104 = 0.168
x p(x)
0 0.168
1 0.394
2 0.320
3 0.106
4 0.012
37. John and Peter play a game with a coin such that P(head) = p. The game consists of tossing a coin twice. John wins if the same result is obtained in the two tosses, and Peter wins if the two results are different. (a) At what value of p is neither of them favored by the game?
(b) If p is different from your answer in (a), who is favored?
Solution: P(John wins) = p2 + (1 − p)2 and P(Peter wins) = 1 − P(John wins).
K14521_SM-Color_Cover.indd 98
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
93
(a) When P(John wins) = P(Peter wins) = 1/2, the game is fair. 1 2 2p2 + 2(1 − p)2 = 1
P(John wins) = p2 + (1 − p)2 =
4p2 − 4p + 1 = 0 (2p − 1)2 = 0 1 =⇒ p = 2
If p = 1/2 both of them have the same probability of winning the game. (b) Since (2p − 1)2 > 0 for all p = 1/2, John wins for any different answer than that in (a). 38. A bank is going to place a security camera in the ceiling of a circular hall of radius r. What is the probability that the camera is placed nearer the center than the outside circumference if the camera is placed at random? Solution: If the camera is to be placed nearer the center, it will be placed inside a circle of radius 2 if the entire room has a radius of r. The likelihood the camera falls in this circle is 2 π (r /2 ) = 1/4 . 2 πr
r/
39. Let the random variable X be the sum of the numbers on two fair dice. Find an upper bound on P(|X − 7| ≥ 4) using Chebyshev’s Inequality as well as the exact probability for P(|X − 7| ≥ 4). Solution: The probability density is x 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 5 4 3 2 1 , p(x) 36 36 36 36 36 36 36 36 36 36 36 which implies µX = x x · p(x) = 7 and σx2 = x (x − µx )2 · p(x) = 5.83. The bound given by Chebyshev’s Inequality says σx2 k2 5.83 P(|X − 7| ≥ 4) ≤ 2 4 =⇒ P(|X − 7| ≥ 4) ≤ 0.3645833 P(|X − µx | ≥ k) ≤
> > > >
dicerolls >
MX > > > > > > > > > > > > > >
K14521_SM-Color_Cover.indd 100
par(pty = "s") x > >
library(MASS) SS >
x > > + + > >
set.seed(13) sims > > > > > > >
x > > > > > > > > > > > > > + > > > > > > > > > > >
opar >
0.0
0.2
0.4
0.6 x
K14521_SM-Color_Cover.indd 109
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
104
(b) f (x) = kx2 , 0 < x < 2.
So, f (x) = 38 x2 and F (x) =
x
3 2 y 0 8
2
set
kx2 dx = 1 0
2 kx3 =1 3 0 8 k· =1 3 3 k= 8
dy =
x3 8 .
> g k k [1] 0.375 f > >
0.0
0.5
1.0
1.5
2.0
0.0
x
0.5
1.0
1.5
2.0
x
√ (c) f (x) = k x/2, 0 < x < 1.
So, f (x) =
K14521_SM-Color_Cover.indd 110
√ 3 x 2
and F (x) =
x 0
√ 3 y 2
1 0
√ k x set dx = 1 2 1 kx3/2 =1 3 0 1 k· =1 3 k=3
dy = x3/2 .
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
105
> g k k [1] 2.999999 f > >
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
x
0.4
0.6
0.8
1.0
x
51. Given the function f (x) = k,
−1 < x < 1
of the random variable X, find the coefficient of skewness for the distribution. Solution: 1
set
k dx = kx|1−1 = k(1 − (−1)) = 2k = 1 =⇒ k = 1/2 . E [(X−µ)3 ] Skewness is γ1 = , so expected value and variance must be calculated as must σ3 E[X 3 ]. First, find k,
−1
µ = E[X] = and σ 2 = Var[X] =
3
1 −1
E[X ] =
1 −1
1 x2 x dx = =0 2 4 −1
1 x3 1 2 (x − 0)2 dx = = = 2 6 −1 6 3 1
−1
1 x3 x4 dx = =0 2 8 −1
Since Var[X] > 0 and E[(X − µ)3 ] = E[X 3 ] = 0, the skewness is zero.
K14521_SM-Color_Cover.indd 111
30/06/15 11:46 am
106 > > > > >
Probability and Statistics with R, Second Edition: Exercises and Solutions
f > > > >
SS > >
Y integrate(f, 5, 8)$value
# P(X integrate(f, 6, 10)$value
# P(X >=6)
[1] 0.96 > integrate(f, 7, 8)$value
# P(7 < X < 8)
[1] 0.2
56. The number of bottles of milk that a dairy farm fills per day is a random variable with mean 5000 and standard deviation 100. Assume the farm always has a sufficient number of glass bottles to be used to store the milk. However, for a bottle of milk to be sent to a grocery store, it must be hermetically sealed with a metal cap that is produced on site. Calculate the minimum number of metal caps that must be produced on a daily basis so that all filled milk bottles can be shipped to grocery stores with a probability of at least 0.9. Solution: Let X = number of milk bottles filled per day. X ∼ (µ = 5000, σ = 100).
K14521_SM-Color_Cover.indd 116
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
111
σ2 k2 1002 P(|X − 5000| < k) ≥ 1 − 2 k P(|X − µ| < k) ≥ 1 −
Since the probability requested is at least 0.9, 1−
1002 ≥ 0.9 k2 0.1k 2 ≥ 1002
k 2 ≥ 100, 000 k ≥ 100, 000 = 316.2278 =⇒ k = 317 If 5317 metal caps are produced on a daily basis, all filled milk bottles can be shipped to grocery stores with a probability of at least 0.9. 57. Define X as the space occupied by certain device in a 1 m3 container. The probability density function of X is given by f (x) =
630 4 x 1 − x4 , 56
0 < x < 1.
(a) Graph the probability density function. (b) Calculate the mean of X by hand. (c) Calculate the variance X by hand. (d) Calculate P(0.20 < X < 0.80) by hand. (e) Calculate the mean of X using integrate(). (f) Calculate the variance of X using integrate(). (g) Calculate P(0.20 < X < 0.80) using integrate().
Solution: (a) > f curve(f, 0, 1)
K14521_SM-Color_Cover.indd 117
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
1.5 0.0
0.5
1.0
f(x)
2.0
2.5
112
0.0
0.2
0.4
0.6
0.8
1.0
x
(b)
E[X] =
0
1
x·
630 = 56
630 4 x 1 − x4 dx 56
1
x5 − x9 dx 1 630 x6 x10 = − 56 6 10 0 1 630 1 − −0 = 56 6 10 3 630 2 · = = 56 30 4 0
(c) First, find E[X 2 ].
K14521_SM-Color_Cover.indd 118
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
2
E[X ] =
1 0
x2 ·
630 = 56
1
113
630 4 x 1 − x4 dx 56
x6 − x1 0 dx 1 630 x7 x11 = − 56 7 11 0 630 1 1 = − −0 56 7 11 45 630 4 · = = 0.5844 = 56 77 77 0
Var[X] = E[X 2 ] − (E[X])2 =
9 27 45 − = = 0.0219 77 16 1232
(d)
0.8
630 4 x 1 − x4 dx 0.2 56 0.8 630 x5 x9 = − 56 5 9 0.2 5 (0.8) (0.8)9 (0.2)5 (0.2)9 630 − − − = 56 5 9 5 9 222, 183 = 0.5688 = 390, 625
P(0.2 < X < 0.8) =
(e) > fxe EX EX [1] 0.75 (f) > > > >
fxf 0 f (x) = 50 0 x ≤ 0. Using R,
K14521_SM-Color_Cover.indd 122
30/06/15 11:46 am
Chapter 3:
General Probability and Random Variables
117
(a) Find the probability that a car stays more than 1 hour. (b) Let Y = 0.5 + 0.03X be the cost in dollars that the mall has to pay a security service per parked car. Find the mean parking cost for 1000 cars. (c) Find the variance and skewness coefficient of Y . Solution: (a) P(X > 60) = 0.3011942 > fx integrate(f = fx, lower = 60, upper = Inf)$value [1] 0.3011942 (b) E[X] = 50, Y = 0.5 + 0.03X, E[Y ] = 0.5 + 0.03E[X] = 2, and E[1000Y ] = 1000E[Y ] = 2000. > > > > >
Me
VX LT LT [1] 0.8952425
63. The time, in hours, a child practices his musical instrument on Saturdays has pdf k(1 − x) for 0 ≤ x ≤ 1 f (x) = 0 otherwise. (a) Find k to make f (x) a valid pdf. (b) Write the cdf and find the probability the child practices more than 48 minutes on a Saturday. Solution: (a) To be a pdf, the integral of f (x) must equal 1. 1=
1 0
1 k k k k(1 − x) dx = − (1 − x)2 = (1 − 0)2 = 2 2 2 0
The integral of f (x) is 1 when k = 2. (b) Since
it follows that
K14521_SM-Color_Cover.indd 124
x 0
x 2(1 − t) dt = −(1 − t)2 = 1 − (1 − x)2 = x(2 − x), 0
0 F (x) = x(2 − x) 1
x 48/60 ) = 1 − P(X ≤ 0.8) = 1 − F (0.8) = 1 − 0.8(2 − 0.8) = 0.04.
> f integrate(f, 0.8, 1)$value [1] 0.04
64. Consider the equilateral triangle ABC with side l. Given a randomly chosen point R in the triangle, calculate the cumulative and the probability density functions for the distance from R to the side BC. Construct a graph of the cumulative density function for √ different values of l. (Hint: The equation of the line CA is y = 3x.)
A
l
l
R
M
N
x
C
B l
Solution: Let X = the distance from R to side BC. of CM N B The P(X ≤ x) = Area Area of ABC Area of CM N B =
1 2
Area of ABC = 12 l ·
K14521_SM-Color_Cover.indd 125
[length(M N ) + length(CB)] · x =
√
3 2 l
=
l2
√ 4
3
1 2
l−
2x √ 3
+ l · x = lx −
x2 √ 3
30/06/15 11:46 am
120
Probability and Statistics with R, Second Edition: Exercises and Solutions
So, P(X ≤ x) =
x2 √ 3 √ l2 3 4
lx −
=
√
√ 4 x2 4x(l 3 − x) √ · lx − √ = FX (x) = 3l2 l2 3 3
Note 0 ≤ x ≤ l 2 3 . √ (x) = 4l 3l3−8x when 0 ≤ x ≤ f (x) = FX 2
Fx + + > + + + > + + + + +
√ l 3 2
0.0
Side Length (l) = 1 Side Length (l) = 2 Side Length (l) = 3 Side Length (l) = 4 Side Length (l) = 5 0
1
2
3
4
Distance to CB
K14521_SM-Color_Cover.indd 126
30/06/15 11:46 am
Chapter 4 Univariate Probability Distributions
1. Let X be a Poisson random variable with mean equal to 2. Find P(X = 0), P(X ≥ 3), and P(X ≤ k) ≥ 0.70. Solution: P(X = 0) = 0.1353, P(X ≥ 3) = 0.3233, and P(X ≤ k) ≥ 0.70 =⇒ k = 3. > dpois(0, 2) [1] 0.1353353 > ppois(2, 2, lower = FALSE) [1] 0.3233236 > qpois(0.7, 2) [1] 3
2. Let X be an exponential random variable Exp(λ = 3). Find P(2 < X < 6). Solution: P(2 < X < 6) = 0.0025 > pexp(6, 3) - pexp(2, 3) [1] 0.002478737
3. Let X be a normal random variable N (µ = 7, σ = 3). Calculate P(X > 7.1). Find the value of k such that P(X < k) = 0.8. Solution: P(X > 7.1) = 0.4867, and k = 9.5249 if P(X < k) = 0.8. > pnorm(7.1, 7, 3, lower = FALSE) [1] 0.4867044 > qnorm(0.80, 7, 3) [1] 9.524864
121
K14521_SM-Color_Cover.indd 127
30/06/15 11:46 am
122
Probability and Statistics with R, Second Edition: Exercises and Solutions
√ 4. Let X be a normal random variable N µ = 3, σ = 0.5 . Calculate P(X > 3.5). Solution: P(X > 3.5) = 0.2398
> pnorm(3.5, 3, sqrt(0.5), lower = FALSE) [1] 0.2397501
5. Let X be a gamma random variable Γ(α = 2, λ = 6). Find the value a such that P(X < a) = 0.95. Solution: a = 0.7906 if P(X < a) = 0.95
> qgamma(0.95, 2, 6) [1] 0.7906441
6. Construct a plot for the probability mass function and the cumulative probability distribution of a binomial random variable Bin(n = 8, π = 0.3). Find the smallest value of k such that P(X ≤ k) ≥ 0.44 when X ∼ Bin(n = 8, π = 0.7). Calculate P(Y ≥ 3) if Y ∼ Bin(20, 0.2). Solution:
> > + + + > > + + +
K14521_SM-Color_Cover.indd 128
DF1 sum(dbinom(42:54, 60, 0.80)) [1] 0.9658254
9. It is known that 3% of the seeds of a certain variety of tomato do not germinate. The seeds are sold in individual boxes that contain 20 seeds per box with the guarantee that at least 18 seeds will germinate. Find the probability that a randomly selected box does not fulfill the aforementioned requirement. Solution: Let X = number of seeds that germinate. X ∼ Bin(n = 20, π = 0.97), P(X ≤ 17) = 0.021. > pbinom(17, 20, 0.97) [1] 0.02100836
10. A garage has two machines, A and B, to balance the wheels of a car. Suppose that 95% of the wheels are correctly balanced by machine A, while 85% of the wheels are correctly balanced by machine B. A machine is randomly selected to balance 20 wheels, and 3 of them are not properly balanced. What is the probability that machine A was used? What is the probability machine B was used? Solution: Let A = machine A was used to balance the wheels and B = machine B was used to balance the wheels. Let E = the event that 3 out of 20 wheels are balanced improperly. If X = the number of wheels balanced improperly, then P(E|A) = P(X = 3) where X ∼ Bin(20, 0.05) = 0.0596 and P(E|B) = P(X = 3) where X ∼ Bin(20, 0.15) = 0.2428 P(A|E) = = = P(B|E) = = =
K14521_SM-Color_Cover.indd 130
P(A ∩ E) P(E) P(E|A) · P(A) P(E|A) · P(A) + P(E|B) · P(B) 0.0596 × 0.5 = 0.197 0.0596 × 0.5 + 0.2428 × 0.5 P(B ∩ E) P(E) P(E|B) · P(B) P(E|A) · P(A) + P(E|B) · P(B) 0.2428 × 0.5 = 0.803 0.0596 × 0.5 + 0.2428 × 0.5
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
125
11. Traffic volume is an important factor for determining the most cost-effective method to surface a road. Suppose that the average number of vehicles passing a certain point on a road is 2 every 30 seconds. (a) Find the probability that more than 3 cars will pass the point in 30 seconds. (b) What is the probability that more than 10 cars pass the point in 3 minutes? Solution: (a) Let X = number of cars passing a certain point on the road in 30 seconds. X ∼ Pois(λ = 2). P(X > 3) = 1 − P(X 1 - ppois(3, 2) [1] 0.1428765 (b) Let Y = number of cars passing a certain point on the road in 3 minutes. Y ∼ Pois(λ = 12). P(Y > 10) = 1 − P(X 1 - ppois(10, 12) [1] 0.6527706
12. The retaining wall of a dam will break if it is subjected to the pressure of two floods. If the average number of floods in a century is two, find the probability that the retaining wall lasts more than 20 years. Solution: Let X = number of floods a retaining wall is subjected to each year. X ∼ Pois(λ = 2/100). Let W = waiting time until the αth flood. The probability the retaining wall lasts more than 20 years is P(W > 20) = 0.9384 where W ∼ Γ(2, 2/100). > pgamma(20, 2, 2/100, lower = FALSE) [1] 0.9384481
13. A particular competition shooter hits his targets 70% of the time with any pistol. To prepare for shooting competitions, this individual practices with a pistol that holds 5 bullets on Tuesday, Thursday, and Saturday, and a pistol that holds 7 bullets the other days. If he fires at targets until the pistol is empty, find the probability that he hits only one target out of the bullets shot in the first round of bullets in the pistol he is carrying that day. In this case, what is the probability that he used the pistol with 7 bullets? Solution: If he fires at targets until the pistol is empty, the probability that he hits only one target out of the bullets shot in the first round of bullets in the pistol he is carrying that day is 0.0142. Let X = number of shots that hit the target. X|T RS ∼ Bin(5, 0.7) and X|M W F S ∼ Bin(7, 0.7).
K14521_SM-Color_Cover.indd 131
30/06/15 11:46 am
126
Probability and Statistics with R, Second Edition: Exercises and Solutions
P(X = 1) = P(X = 1|T RS)P (T RS) + P(X = 1|M W F S)P (M W F S) 4 5 7 1 4 3 = (0.7) (0.3) · + (0.7)1 (0.3)6 · 1 7 1 7 = 0.0142
> AnsA AnsA [1] 0.0141912 The probability that he used the pistol with 7 bullets if he hits only one target is 0.1438. P(X = 1|M W F S)P (M W F S) P(X = 1) 7 (0.7)1 (0.3)6 · 47 1 = 0.0142 = 0.1438
P(M W F S|X = 1) =
> AnsB AnsB [1] 0.1438356
14. The lifetime of a certain engine follows a normal distribution with mean and standard deviation of 10 and 3.5 years, respectively. The manufacturer replaces all catastrophic engine failures within the guarantee period free of charge. If the manufacturer is willing to replace no more than 4% of the defective engines, what is the largest guarantee period the manufacturer should advertise? Solution: Let EF ∼ N (10, 3.5), Given P(EF ≤ k) = 0.04 =⇒ k = 3.8726 years is the largest guarantee period the manufacturer should advertise. > qnorm(0.04, 10, 3.5) [1] 3.872599
15. Agronomists are developing an improved variety of green peppers. Supermarket managers have indicated customers are not likely to purchase green peppers weighing less
K14521_SM-Color_Cover.indd 132
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
127
than 45 grams. The current variety of green pepper plants produces green peppers that weigh 48 grams on average, but 13% weigh less than 45 grams. Assume the weight of the current variety of green peppers follows a normal distribution. (a) What is the standard deviation of the weights of the current variety of green peppers? (b) The agronomists want to reduce the frequency of green peppers weighing less than 45 grams to no more than 5%. One way to reduce the frequency of underweight green peppers is to increase the weight of the green peppers. If the standard deviation remains the same, what mean should the agronomists target as a goal? (c) The agronomists produce a new variety of green peppers with a mean weight of 50 grams, which meets the 5% goal. What is the standard deviation of the weights of these new green peppers? (d) Does the current variety or the new variety produce a green pepper with a more consistent weight? Solution: (a) > z z [1] -1.126391 > sig sig [1] 2.663373
x−µ σ 45 − 48 −1.1264 = σ −1.1264 · σ = −3 −3 = 2.6634 σ= −1.1264 z0.13 =
The standard deviation for the current variety of green peppers is 2.6634 grams. (b) > z z [1] -1.644854 > mu mu [1] 49.38086
K14521_SM-Color_Cover.indd 133
30/06/15 11:46 am
128
Probability and Statistics with R, Second Edition: Exercises and Solutions
x−µ σ 45 − µ −1.6449 = 2.6634 −1.6449 · 2.6634 = −4.3809 = 45 − µ µ = 45 − −4.3809 = 49.3809 z0.05 =
The agronomists should attempt to create green peppers with a mean weight of 49.3809 grams. (c) > z sig2 sig2 [1] 3.039784 x−µ σ 45 − 50 −1.6449 = σ 45 − 50 = 3.0398 σ= −1.6449 z0.05 =
The standard deviation for the new variety of green peppers is 3.0398 grams. (d) Since the standard deviation of the new variety (3.0398 grams) is greater than the standard deviation of the current variety (2.6634 grams), the new variety is less consistent with respect to weight than the current variety. 16. Given independent random variables Y1 , Y2 , X, W, Z1 , Z2 , and Z3 , (a) Compute P ((Y2 ≥ 3) ∪ (Y1 < 9)) if Y1 ∼ Bin(n = 10, π = 0.3) and Y2 ∼ Bin(n = 5, π = 0.1). (b) Compute P(X ≥ 2|X < 6) if X ∼ Pois(λ = 4). (c) If W ∼ N (µ, σ), find the value of k that satisfies the equation P(µ < W < µ + 2kσ) = 0.45. Z12 + Z22 + Z32 > 1.5 . (d) If Zi ∼ N (0, 1) for i = 1, 2, 3, compute P Solution: (a) > > > > >
K14521_SM-Color_Cover.indd 134
P1
PN 1.5 = P χ23 > (1.5)2 = 0.5222
K14521_SM-Color_Cover.indd 135
30/06/15 11:46 am
130
Probability and Statistics with R, Second Edition: Exercises and Solutions
17. Derive the mean and variance for the discrete uniform distribution. n n(n + 1) n n(n + 1)(2n + 1) ; i=1 x2i = , when xi = 1, 2, . . . , n.) (Hints: i=1 xi = 2 6
Solution:
X is a discrete uniform, which means it takes on values in {1, 2, 3, . . . , n} with probability 1 n each. n n n E[X] = i=1 xi · P(X = xi ) = i=1 i · n1 = n1 i=1 i = n1 · n(n+1) = n+1 2 2 n n n E[X 2 ] = i=1 x2i · P(X = xi ) = i=1 i2 · n1 = n1 i=1 i2 = n1 · n(n+1)(2n+1) = (n+1)(2n+1) 6 6 2
2
Var[X] = E[X ] − (E[X]) = = = =
2 n+1 (n + 1)(2n + 1) − 6 2 2n2 + 3n + 1 n2 + 2n + 1 − 6 4 4n2 + 6n + 2 − 3n2 − 6n − 3 12 n2 − 1 12
18. Suppose the percentage of drinks sold from a vending machine are 80% and 20% for soft drinks and bottled water, respectively. (a) What is the probability that on a randomly selected day, the first soft drink is the fourth drink sold? (b) Find the probability that exactly 1 out of 10 drinks sold is a soft drink. (c) Find the probability that the fifth soft drink is the seventh drink sold. (d) Verify empirically that P Bin(n, π) ≤ r − 1 = 1 − P NB (r, π) ≤ (n − r) , with n = 10, π = 0.8, and r = 4. Solution: Let X = number of waters (failures) purchased before the first soft drink is purchased. Then, X ∼ Geo(0.80). (a) P(X = 3) = 0.0064 or use X ∼ NB (1, 0.80). > dgeom(3, 0.80) [1] 0.0064 > dnbinom(3, 1, 0.80) [1] 0.0064 (b) Let X = number of soft drinks sold. Then, X ∼ Bin(10, 0.80) and P(X = 1) = 0.
K14521_SM-Color_Cover.indd 136
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
131
> dbinom(1, 10, 0.80) [1] 4.096e-06 (c) Let X = number of water purchased before the fifth soft drink is purchased. Then, X ∼ NB (5, 0.80) and P(X = 2) = 0.1966. > dnbinom(2, 5, 0.80) [1] 0.196608 (d) > A B c(A, B) [1] 0.0008643584 0.0008643584
19. The hardness of a particular type of sheet metal sold by a local manufacturer has a normal distribution with a mean of 60 micra and a standard deviation of 2 micra. (a) This type of sheet metal is said to conform to specification provided its hardness measure is between 57 and 65 micra. What percent of the manufacturer’s sheet metal can be expected to fall within the specification? (b) A building contractor agrees to purchase metal from the local metal manufacturer at a premium price provided four out of four randomly selected pieces of metal test between 57 and 65 micra. What is the probability the building contractor will purchase metal from the local manufacturer and pay a premium price? (c) If an acceptable sheet of metal is one whose hardness is not more than c units away from the mean, find c such that 97% of the sheets are acceptable. (d) Find the probability that at least 10 out of 20 sheets have a hardness greater than 60. Solution: (a) Given X ∼ N (60, 2), P(57 < X < 65) = 0.927. The percent of the manufacturer’s sheet metal that can be expected to fall within the specification is 92.6983%. > p p [1] 0.9269831 (b) Let Y = number of sheets that test between 57 and 65 micra. Then, Y ∼ Bin(4, 0.927) and P(Y = 4) = 0.7384, so there is a 0.7384 chance the contractor will purchase from the local manufacturer and pay a premium price. > dbinom(4, 4, p) [1] 0.7383926 (c) P(X < k) = 0.985 =⇒ k = 64.3402 =⇒ c = 4.3402
K14521_SM-Color_Cover.indd 137
30/06/15 11:46 am
132
Probability and Statistics with R, Second Edition: Exercises and Solutions
> k INC c(k, INC) [1] 64.340181
4.340181
(d) Let W = number of sheets that have hardness greater than 60. Then, W ∼ Bin(20, 0.5), and P(W ≥ 10) = 1 − P(W ≤ 9) = 0.5881. > 1 - pbinom(9, 20, 0.5) [1] 0.5880985
20. The weekly production of a banana plantation can be modeled with a normal random variable that has a mean of 5 tons and a standard deviation of 2 tons. (a) Find the probability that, in at most 1 out of the 8 randomly chosen weeks, the production has been less than 3 tons. (b) Find the probability that at least 3 weeks are needed to obtain a production greater than 10 tons. Solution: (a) Let X = number of weeks where production is less than 3 tons. Then, X ∼ Bin(8, 0.1587). Let W = production, W ∼ N (5, 2) and P(W ≤ 3) = 0.1587. So, P(X ≤ 1) = 0.6298. > p pbinom(1, 8, p)
# P(W < 3)
[1] 0.6298268 (b) Let Xn = number of weeks where production is less than 10 tons before the first week with production over 10 tons. Xn ∼ Geo(π = P(W ≥ 10) = 0.0062), and P(Xn ≥ 2) = 1 − P(Xn ≤ 1) = 0.9876. > PI PI
# P(W >= 10)
[1] 0.006209665 > 1 - pgeom(1, PI) [1] 0.9876192
21. A bank has 50 deposit accounts with e 25,000 each. The probability of having to close a deposit account and then refund the money in a given day is 0.01. If account closings are independent events, how much money must the bank have available to guarantee it can refund all closed accounts in a given day with probability greater than 0.95? Solution:
K14521_SM-Color_Cover.indd 138
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
133
Let X = number of accounts closed and refunded out of 50. Then, X ∼ Bin(50, 0.01). The minimum number of accounts closed where the probability is at least 0.95 occurs where P(X ≤ k) ≥ 0.95 =⇒ k = 2. Therefore, the bank must have on hand 2×25, 000 = e 50, 000. > k k [1] 2 > euros euros [1] 50000
22. The mean number of calls a tow truck company receives during a day is 5 per hour. Find the probability that a tow truck is requested more than 4 times per hour in a given hour. What is the probability the company waits for less than 1 hour before the tow truck is requested 3 times? Solution: Let X = number of times per hour the tow truck is requested. X ∼ Pois(λ = 5) and P(X > 4) = 1 − P(X ≤ 4) = 0.5595. > 1 - ppois(4, 5) [1] 0.5595067 Let W = the waiting time before the tow truck is requested 3 times. W ∼ Γ(3, 5) and P(W < 1) = 0.8753. The probability less than 1 hour goes by before the tow truck is requested 3 times is 0.8753. > pgamma(1, 3, 5) [1] 0.875348
23. In the printing section of a plastics company, a machine receives on average 6 buckets per minute to be painted and paints them. The machine has been out of service for 90 seconds due to a power failure. (a) Find the probability that more than 8 buckets remain unpainted. (b) Find the probability that the first bucket, after the electricity is restored, arrives before 10 seconds have passed. Solution: (a) Let X = number of buckets received per minute. Then, X ∼ Pois(λ = 6). If 6 buckets arrive in 60 seconds, 9 buckets are expected to arrive in 90 seconds. Let Y = number of buckets that arrive in 90 seconds. It follows then that Y ∼ Pois(λ = 9). Assume an employee is hand transporting the buckets at the same rate given in the problem. P(Y > 8) = 1 − P(Y ≤ 8) = 0.5443
K14521_SM-Color_Cover.indd 139
30/06/15 11:46 am
134
Probability and Statistics with R, Second Edition: Exercises and Solutions
> 1 - ppois(8, 9) [1] 0.5443474 (b) Let T = time until the first bucket arrives. T ∼ Γ(1, 6) and P(T < 1/6) = 0.6321. > pgamma(1/6, 1, 6) [1] 0.6321206
24. Give a general expression to calculate the quantiles of a Weibull random variable. Solution: xj
The j th quantile (0 ≤ j ≤ 1) is the value xj such that For a Weibull, f (x) =
−∞
α
αβ −α xα−1 e−(x/β) 0
F (xj ) =
f (x) dx = j.
x≥0 x pweibull(12, 3, 25) [1] 0.104696
K14521_SM-Color_Cover.indd 140
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
135
Since there is a 10.4696% chance of break down in the first 12 months and there are 50 cars, we can expect 5.2348 cars to break down during the guarantee period. Consequently, the expected price of the guarantee is $4187.8417. > EBD EBD
# E(Break Downs)
[1] 5.234802 > EGP EGP
# E(Price of Guarantee)
[1] 4187.842 > + + + + + + > > > + + + +
limitRange >
set.seed(500) rs > > + + > > >
set.seed(50) rs > > >
K14521_SM-Color_Cover.indd 143
f1 abs(var(Volume) - VV)/VV*100 [1] 0.2798443 The simulated expected value of the volume is 0.3033% from the theoretical expected value of the volume, and the simulated variance of the volume is 0.2798% from the theoretical variance of the volume. 29. Let X be a random variable with probability density function f (x) = 3
4 1 , x
x ≥ 1.
(a) Find the cumulative density function. (b) Fix the seed at 98 (set.seed(98)), and generate a random sample of size n = 100, 000 from X’s distribution. Compute the mean, variance, and coefficient of skewness for the random sample. (c) Obtain the theoretical mean, variance, and coefficient of skewness of X. (d) How close are the estimates in (b) to the theoretical values in (c)? Solution: (a) 4 1 dt t 1 x = −t−3 1
FX (x) =
x
3
= −x−3 + 1
(b) To do the generation, find the relationship between a uniform and X.
K14521_SM-Color_Cover.indd 144
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
139
set
FX (x) = −x−3 + 1 = u
−x−3 = u − 1 x−3 = 1 − u
x = (1 − u)−1/3
x = 1/(1 − u)1/3 > > > > > > >
set.seed(98) n 2. (d) To do the generation, find the relationship between a uniform and X. set
1 − x−θ = u
x−θ = 1 − u 1
x = (1 − u)− θ = > > > >
K14521_SM-Color_Cover.indd 146
1 (1 − u)1/θ
set.seed(42) n .75)? (d) Fix the seed at 13 (set.seed(13)), and generate 100,000 realizations of X. What are the mean and variance of the random sample? (e) Calculate the theoretical mean and variance of X. (f) How close are the estimates in (d) to the theoretical values in (e)? Solution: (a)
K14521_SM-Color_Cover.indd 147
1 0
f (x) dx =
1
4 x(2 − x2 ) dx 0 3
=
1
8x 0 3
−
4x3 3
dx =
4x2 3
−
1
x4 3
0
=
4 3
−
1 3
− (0 − 0) = 1
30/06/15 11:46 am
142
Probability and Statistics with R, Second Edition: Exercises and Solutions
> f integrate(f, 0, 1)$value [1] 1 (b) FX (x) =
x
8t 0 3
−
4t3 3
dt =
4t2 3
−
x
t4 3
0
=
4x2 3
−
x4 3 ,0
≤x≤1
(c) P(X > 0.75) = 1 − F (.75) = 0.3555 > 1 - (4*0.75^2/3 - 0.75^4/3) [1] 0.3554688 > # or > integrate(f, 0.75, 1)$value [1] 0.3554688
(d) Set the cdf equal to u from a uniform. set
u = FX (x) x4 4x2 − 3 3 x4 − 4x2 + 3u = 0 u=
Let y = x2 .
y 2 − 4y + 3u = 0
16 − 4(1)(3u) 2 √ y = 2 ± 4 − 3u 4±
=⇒ y =
Since x must fall between 0 and 1, only 2 − x= > > > > > > >
√ 4 − 3u is a viable solution for y. This means
2−
√
4 − 3u
set.seed(13) n + + >
# or f > > > > >
n 1|θ = 5) = 1 − P (X ≤ 1|θ = 5) = 0 > Fx 1 - Fx(1) [1] 1.507017e-07 > > + + >
# Or f > > > > > >
set.seed(201) n = 0.1 σ σ 0.13 − 0.11 = 0.1 P Z> σ 0.13 − 0.11 =⇒ Z0.9 = σ 1.2816σ = 0.02 σ = 0.0156
∴ X ∼ N (0.11, 0.0156) > mu sigma c(mu, sigma) [1] 0.11000000 0.01560608 (b) P(0.10 < X < 0.13) = 0.6392
K14521_SM-Color_Cover.indd 154
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
149
> Area Area [1] 0.6391658 (c) Let Y = number of usable conductor cables. Y ∼ Bin(n = 5, π = 0.6392). P(Y ≥ 3) = 1 − P (Y ≤ 2) = 0.7478 > 1 - pbinom(2, 5, Area) [1] 0.7477729
35. Abinomial, Bin(n, π), distribution can be approximated by a normal distribution, N nπ, nπ(1 − π) , when nπ > 10 and n(1 − π) > 10. The Poisson distribution can also √ be approximated by a normal distribution N λ, λ if λ > 10. Consider a sequence from 5 to 27 of a variable X (binomial or Poisson) and show that for n = 80, π = 0.2, and λ = 16 the aforementioned approximations are appropriate. The normal approximation to a discrete distribution can occasionally be improved by adding 0.5 to the normal random variable when finding the area to the left of said random variable. Specifically, create a table showing P(X ≤ x) for the range of X for the four distributions and a graph showing the density of the normal distributions with vertical lines showing P(X = x) for the binomial and Poisson distributions, respectively. Solution: > > > > > > > >
n
151
x choose(4, 3)*choose(4, 1)/choose(8, 4) [1] 0.2285714 2 2 38. Consider thefunction g(x) = (x − a) is finite. , where a is a constant and E (X − a) 2 Find a so that E (X − a) is minimized.
Solution:
h(a) = E[(x − a)2 ] = E[X 2 ] − 2aE[X] + a2 . set
Then h (a) = −2E[X] + 2a = 0 =⇒ a = E[X]. Since, h (a) = 2 > 0, a = E[X] minimizes h(a). 39. Consider the random variable X ∼ Weib(α, β). (a) Find the cdf for X. (b) Use the definition of the hazard function to verify that for X ∼ Weib(α, β), the hazard αtα−1 function is given by h(t) = . βα Solution: α
(a) For X ∼ Weib(α, β), f (x) = αβ −α xα−1 e−(x/β) if x ≥ 0, and f (x) = 0 if x < 0.
x
α
αβ −α tα−1 e−(t/β) dt 0 α x = −e−(t/β)
FX (x) =
=1−e
0 −(x/β)α
for x ≥ 0
(b) h(t) =
f (t) 1 − F (t)
α
αβ −α tα−1 e−(t/β) = α 1 − (1 − e−(t/β) ) α
αβ −α tα−1 e−(t/β) α e−(t/β) αtα−1 = αβ −α tα−1 = βα
=
K14521_SM-Color_Cover.indd 158
30/06/15 11:46 am
Chapter 4:
Univariate Probability Distributions
153
40. If X ∼ Bin(n, π), use the binomial expansion to find the mean and variance of X. To x 1 = (x−1)! find the variance, use the second factorial moment E[X(X − 1)] and note that x! when x > 0. Solution: X ∼ Bin(n, π) =⇒ P(X = x) = Mean:
n
π x (1 − π)n−x .
x
n x π (1 − π)n−x x· E[X] = x x=0 n
=
n
x · n! π x (1 − π)n−x x!(n − x)! x=0
Note that the first term in the sum (x = 0) is zero =
n
n! π x (1 − π)n−x (x − 1)!(n − x)! x=1
Let k = x − 1. =
n−1
n! π k+1 (1 − π)n−k−1 k!(n − k − 1)!
k=0
= nπ
n−1 k=0
∴ E[X] = nπ
(n − 1)! π k (1 − π)n−1−k k!(n − 1 − k)!
Sum of all Bin(n − 1, π) probabilities = 1
Finding E[X(X − 1)]: E[X(X − 1)] =
n
x=0
x(x − 1) ·
n! π x (1 − π)n−x x!(n − x)!
Note that the first two terms in the sum (x = 0, 1) are zero =
n
n! π x (1 − π)n−x (x − 2)!(n − x)! x=2
Let k = x − 2, so x = k + 2. =
n−2 k
n! π k+2 (1 − π)n−2−k k!(n − 2 − k)!
= π 2 n(n − 1) ∴ E[X(X − 1)] = π 2 n(n − 1)
K14521_SM-Color_Cover.indd 159
n−2
k
(n − 2)! π k (1 − π)n−2−k k!(n − 2 − k)!
Sum of all Bin(n − 2, π) probabilities = 1
30/06/15 11:46 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
154
The second factorial moment is E[X(X − 1)] = E[X 2 ] − E[X], so Var[X] = E[X(X − 1)] + E[X] − (E[X])2 = π 2 n(n − 1) + nπ − (nπ)2 = n2 π 2 − nπ 2 + nπ − n2 π 2 = nπ(1 − π)
41. The speed of a randomly chosen gas molecule in a certain volume of gas is a random variable, V , with probability density function
f (v) =
2 π
M RT
32
M v2
v 2 e− 2RT
for v ≥ 0
where R is the gas constant (= 8.3145 J/mol · K ), M is the molecular weight of the gas, and T is the absolute temperature measured in degrees Kelvin. (Hints:
∞ 0
k −x2
x e
1 dx = Γ 2
k+1 2
,
Γ (α + 1) = αΓ (α) , and
√ 1 Γ = π ). 2
(a) Derive a general expression for the average speed of a gas molecule. 2
(b) If 1 J = 1kg · m /s2 , what are the units for the answer in part (a)? (c) Kinetic energy for a molecule is Ek = kinetic energy of a molecule.
M v2 2 .
Derive a general expression for the average
(d) The weight of hydrogen is 1.008 g/mol . Note that there are 6.0221415 × 1023 molecules in 1 mole. Find the average speed of a hydrogen molecule at 300◦ K using the result from part (a). (e) Use numerical integration to verify the result from part (d). (f) Show the probability density functions for the speeds of hydrogen, helium, and oxygen on a single graph. The molecular weights for these elements are 1.008 g/mol , 4.003 g/ g mol , and 16.00 /mol , respectively.
Solution: (a) E[V ] =
0
K14521_SM-Color_Cover.indd 160
∞
v·
2 π
M RT
32
M v2
v 2 e− 2RT dv
30/06/15 11:46 am
Let y 2 =
M v2 2RT
Chapter 4: Univariate Probability Distributions 2RT ⇒ v = y 2RT and dv = M M dy
155
3 3 2RT 2RT M 2 −y 2 = dy e y RT M M 0 3 3 1 2 M 2 2RT 2 2RT 2 ∞ 3 −y2 y e dy = π RT M M 0 − 32 2 2 RT 3+1 1 2RT · Γ = π M M 2 2 12 2 RT 1 · Γ(2) =4 π M 2 2RT E[V ] = 2 πM
∞
2 π
(b) Finding units for 2 2RT πM :
R is
J/
mol · K ;
T is in K; M is in
Units ONLY of
kg mol
and a J is
2RT are πM
kg · m2 /
J mol·K · kg mol
K
=
s2
.
kg·m2 s2
kg
=
m2 m = s2 s
(c) 3 M v2 2 M 2 2 − M v2 · v e 2RT dv E[Ek ] = 2 π RT 0 3 ∞ M v2 M M 2 2 v 4 e− 2RT dv = 2 RT π 0
Let y 2 =
M v2 2RT
⇒v=y
∞
2RT M
M = 2
and dv =
M RT
32
2 π
4RT 1 4+1 = √ · Γ 2 π 2 √ 2RT 3 π = √ · · π 2 2 3RT = 2
K14521_SM-Color_Cover.indd 161
2RT M
∞ 0
dy
2RT M
4
4 −y 2
y e
2RT dy M
30/06/15 11:46 am
156
Probability and Statistics with R, Second Edition: Exercises and Solutions
(d)
2RT E[V ] = 2 πM J 2(8.3145 mol·K ) · 300K =2 g π · 1.008 mol 2(8.3145) · 300 J · =2 π · 1.008 g 2(8.3145) · 300 1000g · m2 · =2 π · 1.008 s2 · g m = 2510.259 s (e) √ Numerical integration for E[V ] gives an answer in (J/g)1/2 . To convert to m/s, multiply by 1000. > > > > + + + > >
M + + + > > + + + + + + + +
K14521_SM-Color_Cover.indd 162
M >
2 λ2
− µ2 =
2 λ2
set.seed(3) X1 + + + +
laplace VY VY [1] 8.070168 > abs(VY - 8)/8*100 [1] 0.8770954
43. A tombola is a raffle in which prizes are assigned to winning tickets. In a particular tombola, only 2 tickets out of n win a prize. After the two winning tickets are sold, a new tombola is started. The tickets are sold consecutively, and the prize is immediately announced when one person wins. Two friends have decided to play tombola in the following way: One of them buys the first ticket on sale, and the other one buys the first ticket after the first prize has been announced. Derive the probability that each of them wins a prize. If there are m tombolas during the night in which the two friends participate, what is the probability that each of them wins more than one prize? Solution: Let A = friend one buys the first ticket and wins, then P(A) = n2 . Let B = friend two buys the first ticket after the first prize is awarded and wins. Let Bi = friend two buys the first ticket after the first prize is awarded and wins with the ith ticket where i >1. n Note that P(B) = i=2 P(Bi ). Now P(B2 ) = P(B3 ) = P(B4 ) = P(B5 ) = P(B6 ) = .. . .. . P(Bi ) =
1 2 · n n−1 2 n−2 · n n−1 n−2 n−3 · n n−1 n−2 n−3 · n n−1 n−2 n−3 · n n−1
1 n−2 2 · n−2 n−4 · n−2 n−4 · n−2 ·
2 1 · n n−1 1 2 1 · = · n−3 n n−1 2 1 2 1 · · = · n−3 n−4 n n−1 n−5 2 1 2 1 · · · = · n−3 n−4 n−5 n n−1 =
n − (i − 1) 2 1 2 1 n−2 n−3 · ... · · = · . n n−1 n − (i − 3) n − (i − 2) n − (i − 1) n n−1
Thus P(B) =
n i=2
P (Bi ) =
n 2 1 2 1 2 · = (n − 1) · · = . n n−1 n n−1 n i=2
∴ P(B) = P(A) = Let X = the number of winning tickets purchase by friend one. X ∼ Bin(m, π = 2/n) 2 n
K14521_SM-Color_Cover.indd 167
30/06/15 11:47 am
162
Probability and Statistics with R, Second Edition: Exercises and Solutions
P(X > 1) = 1 − P(X ≤ 1) = 1 − P(X = 0) − P(X = 1) = m 1 m−1 0 2 2 m n−2 n−2 m − = 1− 1 0 n n n n m m−1 n−2 n−2 2 = 1− −m n n n m−1 (n − 2) (n + 2m − 2) = 1− . nm Since P(A) = P(B), the probability that each wins more than one prize is (n − 2)m−1 (n + 2m − 2) 2· 1− . nm
44. Consider the World Cup Soccer data stored in the data frame SOCCER. The observed and expected number of goals for a 90-minute game were computed in the “Poisson: World Cup Soccer Example” in this chapter. To verify that the Poisson rate λ is constant, compute the observed and expected number of goals with the time intervals 45, 15, 10, 5, and 1 minute(s). Compute the means and variances for both the observed and expected counts in each time interval. Based on the results, is the probability of exactly one outcome in a sufficiently short interval proportional to the length of the interval? Solution: There is very good agreement between the observed number of goals and the expected number of goals using a Poisson distribution regardless of the time period. Specifically, the probability of a goal in a given period is proportional to the length of the period. The function bigf() is written to answer the question. > + + + + + + + + + + + + + + + + + +
K14521_SM-Color_Cover.indd 168
bigf 0 and ρ = −1 when a < 0. Solution:
Cov [X, Y ] E[XY ] − E[X]E[Y ] = σX σY σX σY E[X(aX + b)] − E[X][aE[X] + b] aE[X 2 ] − aE[X]2 = = 2 σX |a|σX |a|σX
ρX,Y =
= If a > 0, then ρX,Y =
2 aCov [X, X] aσX a = 2 2 = |a| |a|σX |a|σX
a a = 1. If a < 0, then ρX,Y = = −1. |a| |a|
14. Given the joint density function f (x, y) = 6x,
0 < x < y < 1,
find the E[Y | X ] that is the regression line resulting from regressing Y on X. Solution: E[Y |X] =
K14521_SM-Color_Cover.indd 186
∞
−∞
yf (y|x) dy. f (y|x) = f (x, y)/fX (x).
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
181
1 f (x, y) dy = x 6x dy = 6xy|1x = 6x − 6x2 = 6x(1 − x), for 0 < x < 1. 6x 1 6x(1−x) = 1−x for 0 < x < 1. This means 1
fX (x) = f (y|x) =
x
E[Y |X] =
1 x
y·
1 y 2 1+x 1 x 1 1 − x2 dy = = = + . = 1−x 2(1 − x) x 2(1 − x) 2 2 2
15. A poker hand (5 cards) is dealt from a single deck of well-shuffled cards. If the random variables X and Y represent the number of aces and the number of kings in a hand, respectively, (a) Write the joint distribution pX,Y (x, y). (b) What is the marginal distribution of X, pX (x)? (c) What is the marginal distribution of Y , pY (y)?
Hint:
∞ a
y=0
x
b n−x
=
a+b n
.
Solution: (a) pX,Y (x, y) =
44 x
(b) pX (x) =
44 5−x−y 52 5
y
for x = 0, 1, 2, 3, 4; y = 0, 1, 2, 3, 4; 0 ≤ x + y ≤ 5.
f (x, y) =
y
(c) pY (y) =
y
f (x, y) =
4
4
4 48 44 = x 525−x 5−x−y y 5 48 ) (5−x
4
x 52 5 y
4 48 44 y 5−y = 52 x 5−x−y 5 48 (5−y)
4
y 52 5 x
16. If fX,Y (x, y) = 5x − y 2 in the region bounded by y = 0, x = 0, and y = 2 − 2x, find the density function for the marginal distribution of X, for 0 < x < 1. Solution:
K14521_SM-Color_Cover.indd 187
30/06/15 11:47 am
182
Probability and Statistics with R, Second Edition: Exercises and Solutions
fX (x) =
2−2x
5x − y 2 dy
0
2−2x y 3 = 5xy − 3 0
(2 − 2x)3 3 8 2 = 10x − 10x − (1 − 3x + 3x2 − x3 ) 3 8 3 8 = x + 18x − 18x2 − for 0 < x < 1 3 3 = 5x(2 − 2x) −
17. If f (x, y) = e−(x+y) , x > 0, and y > 0, find P X + 3 > Y X > Solution:
1 3
.
P X + 3 > Y, X > 13 1 = P X + 3 > Y X > 3 P X > 13
∞ x+3 1 = e−(x+y) dy dx P X + 3 > Y, X > 1 3 0 3∞ x+3 = −e−(x+y) dx 0
1
=
3∞ 1 3
−e−(2x+3) + e−x dx
∞ e−(2x+3) −x −e = 2 1 3
e−11/3 + e−1/3 =− 2
fX (x) =
∞ 0
e−(x+y) dy = e−x
1 P X> 3
=
∞
e−x dx
1 3
∞ = −e−x 1 3
=e
K14521_SM-Color_Cover.indd 188
−1/3
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
183
1 P X + 3 > Y, X > 1 3 P X + 3 > Y X > = 3 P X > 13
=
−e
−11/3
2
+ e−1/3
e−1/3
1 = 1 − e−10/3 = 0.9822 2
18. If f (x, y) = 1, 0 < x < 1, 0 < y < 1, what is P Y − X >
1 2
Solution:
X + Y >
1 2
?
P Y − X > 21 , X + Y > 12 1 1 = P Y − X > X + Y > 2 2 P X + Y > 12 P Y > X + 21 , Y − 12 > X = P X > 12 − Y
fX (x) =
1 0
f (x, y) dy =
1 0
1dy = y|10 = 1 for 0 < x < 1
1 1 P X > −Y =1−P X ≤ −Y 2 2 12 12 −y 1 dx dy =1− 0
=1−
0
1 2
0
1
x|02
−y
dy
1 − y dy =1− 2 0 1 y y 2 2 =1− − 2 2 0 1 1 − =1− 4 8 7 1 =1− = 8 8
K14521_SM-Color_Cover.indd 189
1 2
30/06/15 11:47 am
184
Probability and Statistics with R, Second Edition: Exercises and Solutions
1 1 P Y >X+ , Y − >X 2 2
=
=
1 1 2
1 1 2
y− 12
1 dx dy
0 y− 12
x|0
dy
1
1 dy 2 2 1 y y = − 2 2 1 2 1 1 1 1 1 − − − = = 2 2 8 4 8 =
1 2
y−
P Y − X > 21 , X + Y > 12 1 1 = P Y − X > X + Y > 2 2 P X + Y > 12 P Y > X + 21 , Y − 12 > X = P X > 12 − Y
=
1/ 8 7/ 8
=
1 7
19. If f (x, y) = k(y − 2x) is a joint density function over 0 < x < 1, 0 < y < 1, and y > x2 , then what is the value of the constant k? Solution: k must have a value so that
1 0
11 0
0
f (x, y) dx dy = 1
1
f (x, y) dx dy = k 0
1
0
1 x2
(y − 2x) dy dx
1 y2 − 2xy dx 2 0 x2 4 1 1 x 3 =k dx − 2x − − 2x 2 2 0 1 x5 x4 x =k − x2 − − 2 10 2 0 1 1 1 −1− + =k 2 10 2 1 =k − 10 =k
1
=⇒ k = −10
K14521_SM-Color_Cover.indd 190
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
185
20. Let X and Y have the joint density function 4 x + 32 y for 0 < x < 1 and 0 < y < 1, f (x, y) = 3 0 otherwise. Find P (2X < 1 | X + Y < 1 ). Solution:
P (2X < 1, Y < 1 − X) = =
1 2
0 1 2
0 1 2
2 4 x + y dy dx 3 3 0 1−x 1 4 xy + y 2 dx 3 3 0
1−x
1 4 (x − x2 ) + (1 − 2x + x2 ) dx 3 0 3 12 1 2 2 dx = −x + x + 3 3 0 3 1 x2 x 2 −x = + + 3 3 3 0 1 1 5 −1 + + = = 24 12 6 24 =
P (Y < 1 − X) = =
1 0 1 0
2 4 x + y dy dx 3 3 0 1−x 1 4 xy + y 2 dx 3 3
1−x
0
1
4 1 (x − x2 ) + (1 − 2x + x2 ) dx = 3 0 3 1 1 2 = dx −x2 + x + 3 3 0 3 1 x2 x −x = + + 3 3 3 0 1 −1 1 1 + + = = 3 3 3 3 So P (2X < 1 | X + Y < 1 ) =
5/ 5 P (2X < 1, Y < 1 − X) = 1 24 = P (Y < 1 − X) /3 8
21. Let X and Y have the joint density function 6(x − y)2 for 0 < x < 1 and 0 < y < 1, f (x, y) = 0 otherwise.
K14521_SM-Color_Cover.indd 191
30/06/15 11:47 am
186
Probability and Statistics with R, Second Edition: Exercises and Solutions
(a) Find P X < (b) Find P X < Solution:
Y < 1 2 Y = 1 2
1 4
.
1 4
.
(a) P X < 12 , Y < 14 1 1 = P X < Y < 2 4 P Y < 14 12 14 6(x − y)2 dy dx = 01 01 4 6(x − y)2 dy dx 0 0 21 (x−y)3 14 −3 dx 0 0 = 1 (x−y)3 14 −3 dx 0 0 12 x3 (x− 14 )3 − 3 dx = 01 33 (x− 1 )3 x − 34 dx 0 3 1 x4 − (x − 14 )4 02 = 1 x4 − (x − 14 )4 0 1 4 − ( 12 − 14 )4 − 04 − (0 − 14 )4 2 = 4 1 − (1 − 14 )4 − 04 − (0 − 14 )4 =
1 16
1 1 − 256 + 256 1 16 = = 81 1 176 11 1 − 256 + 256
(b) 2 1/2 6 x − 14 dx 1 1 0 = P X < Y = 2 dx 1 2 4 6 x − 14 0 3 1/2 2 x − 14 0 = 1 1 3 2 x− 4 0 3 1 1 3 2 2 − 4 − 0 − 14 = 3 3 2 1 − 14 − 0 − 14 1 1 + 64 2 64 = 27 1 2 64 + 64
K14521_SM-Color_Cover.indd 192
=
2 64 28 64
=
1 14
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
187
22. Let X and Y denote the weight (in kilograms) and height (in centimeters), respectively, of 20-year-old American males. Assume that X and Y have a bivariate normal distribution with parameters µX = 82, σX = 9, µY = 190, σY = 10, and ρ = 0.8. Find (a) E [ Y | X = 75 ], (b) E [ Y | X = 90 ], (c) Var [ Y | X = 75 ], (d) Var [ Y | X = 90 ], (e) P (Y ≥ 190 | X = 75 ), and (f) P (185 ≤ Y ≤ 195 | X = 90 ). Solution: Y (x − µX ) and Var[Y |X] = σY2 |x = σY2 (1 − ρ2 ). Recall that E[Y |X] = µY + ρ σσX
(a) E [ Y | X = 75 ] = 190 + 0.8 ·
10 9 (75
− 82) = 183.7778
> EYgX75 EYgX75 [1] 183.7778 (b) E [ Y | X = 90 ] = 190 + 0.8 ·
10 9 (90
− 82) = 197.1111
> EYgX90 EYgX90 [1] 197.1111 (c) Var [ Y | X = 75 ] = 102 (1 − 0.82 ) = 36 > VYgX75 VYgX75 [1] 36 (d) Var [ Y | X = 90 ] = 102 (1 − 0.82 ) = 36 > VYgX90 VYgX90 [1] 36 √ (e) P (Y ≥ 190 | X = 75 ) = 0.1499 because Y |x=75 ∼ N (183.7778, 36 ). > 1 - pnorm(190, EYgX75, sqrt(VYgX75)) [1] 0.1498593 (f) P (185 ≤ Y ≤ 195 | X = 90 ) = 0.3407
K14521_SM-Color_Cover.indd 193
30/06/15 11:47 am
188
Probability and Statistics with R, Second Edition: Exercises and Solutions
> pnorm(195, EYgX90, sqrt(VYgX90)) - pnorm(185, EYgX90, sqrt(VYgX90)) [1] 0.340706
23. Let X and Y denote the heart rate (in beats per minute) and average power output (in watts) for a 10-minute cycling time trial performed by a professional cyclist. Assume that X and Y have a bivariate normal distribution with parameters µX = 180, σX = 10, µY = 400, σY = 50, and ρ = 0.9. Find (a) E [ Y | X = 170 ], (b) E [ Y | X = 200 ], (c) Var [ Y | X = 170 ], (d) Var [ Y | X = 200 ], (e) P (Y ≤ 380 | X = 170 ), and (f) P (Y ≥ 450 | X = 200 ). Solution: Y (x − µX ) and Var[Y |X] = σY2 |x = σY2 (1 − ρ2 ). Recall that E[Y |X] = µY + ρ σσX 50 (a) E [ Y | X = 170 ] = 400 + 0.9 10 (170 − 180) = 355
> EYgX170 EYgX170 [1] 355 (b) E [ Y | X = 200 ] = 400 + 0.9 50 10 (200 − 180) = 490 > EYgX200 EYgX200 [1] 490 (c) Var [ Y | X = 170 ] = 502 (1 − 0.92 ) = 475 > VYgX170 VYgX170
50^2*(1 - 0.9^2)
[1] 475 (d) Var [ Y | X = 200 ] = 502 (1 − 0.92 ) = 475 > VYgX200 VYgX200
50^2*(1 - 0.9^2)
[1] 475 (e) P (Y ≤ 380 | X = 170 ) = 0.8743
K14521_SM-Color_Cover.indd 194
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
189
> pnorm(380, EYgX170, sqrt(VYgX170)) [1] 0.8743254 (f)P (Y ≥ 450 | X = 200 ) = 0.9668 > 1 - pnorm(450, EYgX200, sqrt(VYgX200)) [1] 0.9667713
24. A certain group of college students takes both the Scholastic Aptitude Test (SAT) and an intelligence quotient (IQ) test. Let X and Y denote the students’ scores on the SAT and IQ tests, respectively. Assume that X and Y have a bivariate normal distribution with parameters µX = 980, σX = 126, µY = 117, σY = 7.2, and ρ = 0.58. Find (a) E [ Y | X = 1350 ], (b) E [ Y | X = 700 ], (c) Var [ Y | X = 700 ], (d) P (Y ≤ 120 | X = 1350 ), and (e) P (Y ≥ 100 | X = 700 ). Solution: Y (x − µX ) and Var[Y |X] = σY2 |x = σY2 (1 − ρ2 ). Recall that E[Y |X] = µY + ρ σσX
7.2 (a) E [ Y | X = 1350 ] = 117 + (0.58) 126 (1350 − 980) = 129.2629
> EYgX1350 EYgX1350 [1] 129.2629 7.2 (b) E [ Y | X = 700 ] = 117 + (0.58) 126 (700 − 980) = 107.72
> EYgX700 EYgX700 [1] 107.72 (c) Var [ Y | X = 700 ] = 7.22 (1 − 0.582 ) = 34.401 > VYgX700 VYgX700 [1] 34.40102 (d) P (Y ≤ 120 | X = 1350 ) = 0.0571 > pnorm(120, EYgX1350, sqrt(VYgX700)) [1] 0.05713586 (e) P (Y ≥ 100 | X = 700 ) = 0.906
K14521_SM-Color_Cover.indd 195
30/06/15 11:47 am
190
Probability and Statistics with R, Second Edition: Exercises and Solutions
> 1 - pnorm(100, EYgX700, sqrt(VYgX700)) [1] 0.9059515
25. A canning industry uses tins with weight equal to 20 grams. The tin is placed on a scale and filled with red peppers until the scale shows the weight µ. Then, the tin contains Y grams of peppers. If the scale is subject to a random error X ∼ N (0, σ = 10), (a) How is Y related to X and µ? (b) What is the probability distribution of the random variable Y ? (c) Calculate the value µ such that 98% of the tins contain at least 400 grams of peppers. (d) Repeat the exercise assuming the weight of the tins to be a normal random variable W ∼ N (20, σ = 5) if X and W are independent. Solution: (a) Y = µ + X − 20 (b) E[Y ] = E[µ + X − 20] = E[µ] + E[X] − E[20] = µ + 0 − 20 = µ − 20 and Var[Y ] = Var[µ + X − 20] = Var[X] = 100, since X ∼ N (0, 10) it follows that Y ∼ N (µ − 20, 10). (c) P (Y ≥ 400) = 0.98 1 − P (Y ≤ 400) = 0.98 P (Y ≤ 400) = 0.02 P (Z ≤ 400 − (µ − 20) /10) = 0.02
=⇒ (420 − µ)/10 = Z0.02 µ = 420 − Z0.02 × 10 = 440.5375
> mu mu [1] 440.5375 µ must be at least 440.5375 grams to be 98% confident that the tins contain at least 400 grams of peppers. (d) Y = µ + X − W , where X ∼ N (0, 10) and W ∼ N (20, 5). E[Y ] = µ + 0 − 20 = µ − 20 Var[Y ] = Var[X] + Var[W ] = 100 + 25 = 125 √ Y ∼ N (µ − 20, 125 ).
K14521_SM-Color_Cover.indd 196
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
P(Y ≥ 400) = 0.98 1 − P(Y ≤ 400) = 0.98 P(Y ≤ 400) = 0.02 √ P Z ≤ 400 − (µ − 20) / 125 = 0.02 √ =⇒ (420 − µ)/ 125 = Z0.02
µ = 420 − Z0.02 ×
√
191
125 = 442.9616
> mu mu [1] 442.9616 µ must be at least 442.9616 grams to be 98% confident that the tins contain at least 400 grams of peppers. 26. Given the joint density function fX,Y (x, y) = x + y, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1, ∞ ∞ (a) Show that fX,Y (x, y) ≥ 0 for all x and y and that −∞ −∞ fX,Y (x, y) dx dy = 1.
(b) Find the cumulative distribution function. (c) Find the marginal means of X and Y . (d) Find the marginal variances of X and Y . Solution:
(a) By inspection x + y is greater than or equal to zero in the ranges for x and y given. For property 2,
1 0
1 0
1 x2 + xy dy 2 0 0 1 1 + y dy = 2 0 1 1 y 2 = y+ 2 2 0 1 1 = + =1 2 2
(x + y) dx dy =
1
(b) The cumulative distribution function must be defined in the five regions: x, y < 0; 0 ≤ x < 1 and 0 ≤ y < 1; x ≥ 1 and 0 ≤ y < 1; y ≥ 1 and 0 ≤ x < 1; and x, y ≥ 1. For the region 0 ≤ x < 1 and 0 ≤ y < 1, x y x2 y xy 2 + . (x + y) dy dx = FX,Y (x, y) = 2 2 0 0
K14521_SM-Color_Cover.indd 197
30/06/15 11:47 am
192
Probability and Statistics with R, Second Edition: Exercises and Solutions
For the region x ≥ 1 and 0 ≤ y < 1, FX,Y (x, y) = For the region y ≥ 1 and 0 ≤ x < 1, FX,Y (x, y) =
y 0
x 0
1
(x + y) dx dy =
y y2 + . 2 2
(x + y) dy dx =
x x2 + . 2 2
0
1 0
0 x2 y xy 2 2 +2 2 FX,Y (x, y) = y2 + y2 x x2 2 + 2 1
x, y < 0 0 ≤ x < 1, 0 ≤ y < 1 x ≥ 1, 0 ≤ y < 1 y ≥ 1, 0 ≤ x < 1 x, y ≥ 1
(c) To find the marginal means and variances, the marginal densities of x and y must be calculated. fX (x) =
1 2
1 0
1 y 2 1 (x + y) dy = xy + = x + for 0 ≤ x ≤ 1 2 0 2
Similarly, fY (y) = y + for 0 ≤ y ≤ 1. 1 3 2 1 7 E[X] = 0 x x + 12 dx = x3 + x4 = 12 0 1 3 2 1 7 E[Y ] = 0 y y + 12 dy = y3 + y4 = 12 0
2
(d) To calculate the variance, E[X ] and E[Y 2 ] must be found: 1 4 3 1 5 E[X 2 ] = 0 x2 x + 12 dx = x4 + x6 = 12 0 1 4 3 1 5 E[Y 2 ] = 0 y 2 y + 12 dy = y4 + y6 = 12 0 5 7 2 11 Var[X] = E[X 2 ] − (E[X 2 ]) = 12 − 12 = 60−49 144 = 144 2 5 7 11 Var[Y ] = E[Y 2 ] − (E[Y 2 ]) = 12 − 12 = 60−49 144 = 144 27. The lifetime of two electronic components are two random variables, X and Y . Their joint density function is given by fX,Y (x, y) = (a) Verify that
∞ ∞
−∞ −∞
1 + x + y + cxy exp(−(x + y)) (c + 3)
x ≥ 0 and y ≥ 0.
fX,Y (x, y) dx dy = 1.
(b) Find fX (x).
(c) What value of c makes X and Y independent? Solution:Note that
K14521_SM-Color_Cover.indd 198
∞ 0
∞ e−x dx = −e−x 0 = 1
30/06/15 11:47 am
Chapter 5: and
(a) ?
1=
∞
∞
∞
Multivariate Probability Distributions
xe
−x
dx =
0
−xe−x |∞ 0
+
∞
193
e−x dx = 1.
0
fX,Y (x, y) dx dy
−∞ −∞ ∞ ∞
1 + x + y + cxy exp(−(x + y)) dx dy (c + 3) 0 0 ∞ ∞ ∞ ∞ 1 exp(−(x + y)) dx dy + x exp(−(x + y)) dx dy = c+3 0 0 0 0 ∞ ∞ ∞ ∞ + y exp(−(x + y)) dx dy + c xy exp(−(x + y)) dx dy 0 0 ∞ 0 0∞ ∞ ∞ 1 −y −x −y −x e e dx dy + e xe dx dy = c+3 0 0 0 0 ∞ ∞ ∞ ∞ + e−x ye−y dy dx + c ye−y xe−x dx dy
=
0
0
0
0
1 [1 + 1 + 1 + c] = c+3
1=1 (b) fX (x) =
∞
fX,Y (x, y) dy
0 ∞
1 + x + y + cxy exp(−(x + y)) dy (c + 3) 0 ∞ ∞ 1 exp(−(x + y)) dy + x exp(−(x + y)) dy = c+3 0 0 ∞ ∞ + y exp(−(x + y)) dy + cxy exp(−(x + y)) dy
=
0
0
1 −x e + xe−x + e−x + cxe−x = c+3 e−x [ 2 + x + cx] for x ≥ 0 fX (x) = c+3
(c) If X and Y are to be independent, fX (x) · fY (y) = fX,Y (x, y). set
fX (x) · fY (y) = fX,Y (x, y)
1 + x + y + cxy −x −y e−x [ 2 + x + cx] e−y [ 2 + y + cy] · = e e c+3 c+3 (c + 3) [ 2 + x + cx] · [ 2 + y + cy] = (1 + x + y + cxy)(c + 3)
4 + 2y + 2cy + 2x + xy + cxy + 2cx + cxy + c2 xy =
c + cx+cy + c2 xy + 3 + 3x + 3y + 3cxy 1 − y − x + xy = c(1 − x − y + xy) ∴ c = 1 ∀(x, y).
K14521_SM-Color_Cover.indd 199
30/06/15 11:47 am
194
Probability and Statistics with R, Second Edition: Exercises and Solutions
If c = 1, X and Y are independent. 28. Given the joint continuous pdf fX,Y (x, y) =
1 if 0 ≤ x ≤ 1, 0 otherwise
0 ≤ y ≤ 1 and
and using the function adaptIntegrate() from the package cubature, (a) find FX,Y (x = 0.6, y = 0.8). (b) find P(0.25 ≤ X ≤ 0.75, 0.1 ≤ Y ≤ 0.9). (c) find fX (x). Solution: (a) FX,Y (x = 0.6, y = 0.8) =
0.6 0.8 0
0
fX,Y (r, s) ds dr = 0.48
> library(cubature) # load > f adaptIntegrate(f, lowerLimit = c(0, 0), upperLimit = c(0.8, 0.6)) $integral [1] 0.48 $error [1] 0 $functionEvaluations [1] 17 $returnCode [1] 0 (b) P(0.25 ≤ x ≤ 0.75, 0.1 ≤ y ≤ 0.9) =
0.75 0.9 0.25
0.1
fX,Y (r, s) ds dr = 0.4
> adaptIntegrate(f, lowerLimit = c(0.10, 0.25), upperLimit = c(0.9, 0.75)) $integral [1] 0.4 $error [1] 0 $functionEvaluations [1] 17 $returnCode [1] 0 1 (c) fX (x) = 0 fX,Y (x, y) dy = 1
K14521_SM-Color_Cover.indd 200
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
195
> adaptIntegrate(f, lowerLimit = 0, upperLimit = 1) $integral [1] 1 $error [1] 1.110223e-14 $functionEvaluations [1] 15 $returnCode [1] 0
29. Let X and Y have the joint density function Kxy 2 ≤ x ≤ 4 and 4 ≤ y ≤ 6 and fX,Y (x, y) = 0 otherwise. (a) Find K so that the given function is a valid pdf. (b) Find the marginal densities of X and Y . (c) Are X and Y independent? Justify. Solution: (a) set
1= =
6
4 6 4
= =
6
4
4
Kxy dx dy 2
4 Kx2 y dy 2 2
16Ky 4Ky − dy 2 2
6
6Ky dy 4
= 3Ky 2 |64
= 3K(36 − 16) 1 = 60K 1 =⇒ K = 60 > f K MASS::fractions(K) [1] 1/60
K14521_SM-Color_Cover.indd 201
30/06/15 11:47 am
196
Probability and Statistics with R, Second Edition: Exercises and Solutions
(b)
fX (x) =
fY (y) =
6 4
4 2
6 xy 2 x x 1 xy dy = [36 − 16] = for 2 ≤ x ≤ 4 = 60 120 4 120 6 4 x2 y y y 1 xy dx = [16 − 4] = for 4 ≤ y ≤ 6 = 60 120 2 120 10
(c) X and Y are independent since fX (x) · fY (y) =
x 6
·
y 10
=
xy 60
= fX,Y (x, y).
30. Given the joint density function of X and Y
fX,Y (x, y) =
1/2 0
x + y ≤ 2, otherwise.
x ≥ 0,
y ≥ 0 and
(a) Find the marginal densities of X and Y .
(b) Find E[X], E[ Y ], Cov [X, Y ], and ρX,Y . (c) Find P X + Y < 1 X >
1 2
.
Solution: (a) The marginal of X is
fX (x) =
2−x
y 2−x x 2−x 1 dy = = 1 − for 0 ≤ x ≤ 2, = 2 2 0 2 2
2−y
x 2−y y 2−y 1 dx = = 1 − for 0 ≤ y ≤ 2. = 2 2 0 2 2
0
and the marginal of Y is
fY (y) =
K14521_SM-Color_Cover.indd 202
0
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
197
(b) 2 2 x x3 2 4 x2 dx = − xfX (x) dx = =2− = x− E[X] = 2 2 6 0 3 3 0 0 2 2 2 2 3 2 y y 2 4 y dy = − yfY (y) dy = y− E[Y ] = =2− 3 = 3 2 2 6 0 0 0 2−y 2 2 2−y 2 2−y xy y x2 dx dy = · xyfX,Y (x, y) dx dy = dy E[XY ] = 2 2 0 0 0 0 0 0 2 2 2 y3 y 4 1 y2 8 y 2 = (4 − 4y + y ) dy = − + =2− +1= 2 3 16 0 3 3 0 4 −1 1 2 2 Cov [X, Y ] = E[XY ] − E[X] · E[Y ] = − · = 3 3 3 9 2 2 2 3 3 x x4 2 8 x 2 2 2 E[X ] = dx = − = −2= x fX (x) dx = x − 2 3 8 3 3 0 0 0 2 2 2 y3 y4 2 8 y3 E[Y 2 ] = dy = − = − 2 = y 2 fY (y) dy = y2 − 2 3 8 3 3 0 0 0 2 2 2 2 = Var[X] = E[X 2 ] − (E[X])2 = − 3 3 9 2 2 2 2 = Var[Y ] = E[Y 2 ] − (E[Y ])2 = − 3 3 9 −1 Cov [X, Y ] −1/9 = = ρX,Y = σX · σY 2 2/9 · 2/9
(c)
2
2
1 P X + Y < 1, X > 1 2 P X + Y < 1 X > = 2 P X > 12 1 1−x 1 1 2 dy dx 0 = 2 2 x 1 1 − 2 dx 2 1 1−x 1 2 dx = 2 2 2 x − x4 1 2 1 x x2 2 − 4 1 2 = 2 2 x − x4 1 1 12 1 1 2 − 4 − 4 − 16 = 1 (2 − 1) − 12 − 16
=
1 4
3 − 16 1 7 = 9 1 − 16
31. Let X and Y have the joint density function
K14521_SM-Color_Cover.indd 203
30/06/15 11:47 am
198
Probability and Statistics with R, Second Edition: Exercises and Solutions
fX,Y (x, y) =
Ky 0
−2 ≤ x ≤ 2, 1 ≤ y ≤ x2 and otherwise.
(a) Find K so that fX,Y (x, y) is a valid pdf. (b) Find the marginal densities of X and Y . (c) Find P Y > 32 X < 12 . Solution:
(a) Note that −2 ≤ x ≤ 2 and 1 ≤ y ≤ x2 implies that 1 ≤ x2 , which means that (x ≤ −1) ∪ (x ≥ 1) is implied in the ranges of the variables. set
1= =
∞ −∞ −1
−2
∞
fX,Y (x, y) dx dy
−∞ x2
Ky dy dx +
1 −1
2
1 2
x2
Ky dy dx 1 4
x4 − 1 x −1 dx + dx 2 2 −2 1 2 −1 x x x5 x5 − − + =K 10 2 −2 10 2 1 −1 1 −32 32 1 1 =K + − +1 + −1 − − 10 2 10 10 10 2 22 22 4 4 =K + + + 10 10 10 10 26 1=K 5 5 =⇒ K = 26 =K
(b) fX (x) =
fX,Y (x, y) dy y x2
5 y dy 26 1 x 2 5 2 y = 52
=
1
5 4 (x − 1) = 52 5 (x4 − 1), =⇒ fX (x) = 52 0,
K14521_SM-Color_Cover.indd 204
(−2 ≤ x ≤ −1) ∪ (1 ≤ x ≤ 2) otherwise
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
fY (y) = =
fX,Y x − √y
(x, y) dx
5 y dx + 26
−2
199
2 √
y
5 y dx 26
− √ y 2 5 5 xy xy + = 26 −2 26 √y 5 3/2 −y = + 2y + 2y − y 3/2 26 5 4y − 2y 3/2 = 26 5 (2y − y 3/2 ), 1 ≤ y ≤ 4 =⇒ fY (y) = 13 0, otherwise (c) P Y > 32 , X < 12 3 1 P Y > X < = 2 2 P X < 12 −√3/2 x2 5 y dy dx 3/2 26 −2 = −1 5 (x4 − 1) dx −2 52 √ − 3/2 y2 x2 dx 2 −2 3/2 = 1 x5 −1 2 ( 5 − x) −2 −√3/2 4 9 x − 4 dx = −1−2 ( 5 + 1) − ( −32 + 2) √ 5 − 3/2 x5 9 5 − 4 x −2 = 4 22 + 5√ 5 3/2 9 9 9 − 4 5 + 4 32 − −32 5 + 2 = 26
=
=
36 20
9
3 2
5
3 19 2 + 10 26 5
+
19 2
√ 26 9 6 + 19 = 0.7893 = 52
32. An engineer has designed a new diesel motor that is used in a prototype earth mover.
K14521_SM-Color_Cover.indd 205
30/06/15 11:47 am
200
Probability and Statistics with R, Second Edition: Exercises and Solutions
The prototype’s diesel consumption in gallons per mile C follows the equation C = 3 + 2X + 32 Y , where X is a speed coefficient and Y is the quality diesel coefficient. Suppose the joint density for X and Y is fX,Y (x, y) = ky, 0 ≤ x ≤ 2, 0 ≤ y ≤ x. (a) Find k so that fX,Y (x, y) is a valid density function.
(b) Are X and Y independent?
(c) Find the mean diesel consumption for the prototype.
Solution: (a) set
1= = =
∞
∞
fX,Y (x, y) dx dy
−∞ −∞ 2 x
0
ky dy dx x ky 2 dx 2 0 0
2
0 2
kx2 dx 2 0 2 kx3 = 6
=
0
8k 1= 6 3 =⇒ k = 4
(b) For X and Y to be independent, fX (x) · fY (y) = fX,Y (x, y).
fX (x) = fY (y) =
fX (x) · fY (y) =
K14521_SM-Color_Cover.indd 206
9 2 32 x (2y
x 0 2 y
x 3 2 3 y dy = y = 4 8 0 2 3 3 y dx = xy = 4 4 y
3 2 x for 0 ≤ x ≤ 2 8 3 (2y − y 2 ) for 0 ≤ y ≤ 2 4
− y 2 ) = 34 y = fX,Y (x, y), so X and Y are not independent.
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
201
(c)
3 3 E[C] = E 3 + 2X + Y = 3 + 2E[X] + E[Y ] 2 2 2 2 2 3x4 3 3 E[X] = x · fX (x) dx = x · x2 dx = = 8 32 0 2 0 0 3 2 4 2 y 3 3 2y 3 16 2 − −4 =1 y · (2y − y ) dy = = E[Y ] = 4 4 3 4 0 4 3 0 3 E[C] = 3 + 2E[X] + E[Y ] 2 3 3 = 3 + 2 · + · 1 = 7.5 2 2 The average gas consumption is 7.5 gallons/mile. 33. To make porcelain, kaolin X and feldspar Y are needed to create a soft mixture that later becomes hard. The proportion of these components for every tone of porcelain has the density function fX,Y (x, y) = Kx2 y, 0 ≤ x ≤ y ≤ 1, x + y ≤ 1. (a) Find the value of K so that fX,Y (x, y) is a valid pdf. (b) Find the marginal densities of X and Y . (c) Find the kaolin mean and the feldspar mean by tone. (d) Find the probability that the proportion of feldspar will be higher than 13 , if the kaolin is more than half of the porcelain. Solution: (a) set
1= = = =
∞
∞
fX,Y (x, y) dx dy
−∞ −∞ 1 1−y
kx2 y dx dy
0
0
1
0
k 3
1−y x3 dy ky · 3 0 1
0
y − 3y 2 + 3y 3 − y 4 dy
1 3 4 y 5 k y2 3 −y + y − = 3 2 4 5 0 3 1 k 1 −1+ − = 3 2 4 5 k 1= 60 =⇒ k = 60
K14521_SM-Color_Cover.indd 207
30/06/15 11:47 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
202 (b)
fX (x) = fY (y) =
1−x 0 1−y 0
1−x 60x2 y dy = 30x2 y 2 0 = 30x2 (1 − x)2 for 0 ≤ x ≤ 1
1−y 60x2 y dx = 20x3 y 0 = 20y(1 − y)3 for 0 ≤ y ≤ 1
(c) E[X] =
0
1
x · 30x2 (1 − x)2 dx =
0
1
30x3 − 60x4 + 30x5 dx
1 1 15 15x4 5 6 − 12x + 5x = − 12 + 5 = = 2 2 2 0 1 1 E[Y ] = y · 20y(1 − y)3 dy = 20y 2 (1 − 3y + 3y 2 − y 3 ) dy 0
0
1 10 1 10y 6 20 20y − 15y 4 + 12y 5 − − 15 + 12 − = = 3 3 0 3 3 3 3
=
(d) P Y > 13 , X > 12 1 1 = P Y > X > 3 2 P X > 12 12 1−y 60x2 y dx dy 1 1 = 1 3 2 2 3 4 1 30x − 60x + 30x dx 2 12 1−y 2 2x y dx dy 1 1 = 13 2 2 3 4 1 x − 2x + x dx 2 1−y 12 x3 dy 1 2y 3 1 3 2 = 3 x x4 x 5 1 3 − 2 + 5 12 12 2y 1 − 3y + 3y 2 − y 3 − 18 dy 1 3 = 3 1 1 1 1 1 1 3 − 2 + 5 − 24 − 32 + 160 12 7 2 3 4 1 4 y − 6y + 6y − 2y dy 3 1 = 1 3 30 − 60 2 1 7y 2y 5 2 3y 4 3 = 20 − 2y + − 8 2 5 1 3 1 3 1 2 1 2 7 7 − + − − − + − = 20 32 4 32 80 72 27 54 1215 389 97 1 = 20 − = = 0.1996 20 9720 486
K14521_SM-Color_Cover.indd 208
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
203
34. A device can fail in four different ways with probabilities π1 = 0.2, π2 = 0.1, π3 = 0.4, and π4 = 0.3. Suppose there are 12 devices that fail independently of one another. What is the probability of 3 failures of the first kind, 4 of the second, 3 of the third, and 2 of the fourth? Solution: P(x1 = 3, x2 = 4, x3 = 3, x4 = 2) =
12! (0.2)3 (0.1)4 (0.4)3 (0.3)2 = 0.0013 3! · 4! · 3! · 2!
> dmultinom(x=c(3, 4, 3, 2), size=12, prob=c(0.2, 0.1, 0.4, 0.3)) [1] 0.001277338
35. The wait time in minutes a shopper spends in a local supermarket’s checkout line has distribution exp(−x/2) , x > 0. f (x) = 2 On weekends, however, the wait is longer, and the distribution then is given by g(x) =
exp(−x/3) , x > 0. 3
Find (a) The probability that the waiting time for a customer will be less than 1 minute. (b) The probability that, given a waiting time of less than 2 minutes, it will be a weekend. (c) The probability that the customer waits less than 2 minutes. Solution: (a) Assuming that the day a person shops is uniformly distributed across the week and letting X = wait time in minutes that a shopper spend in a local supermarket’s check out line and Y = weekend indicator where 0 is a weekday and 1 is a weekend, P(X < 1) = P(X < 1 ∩ Y = 0) + P(X < 1 ∩ Y = 1) = P(X < 1 | Y = 0 )P(Y = 0) + P(X < 1 | Y = 1 )P(Y = 1) 5 1 e−x/2 2 1 e−x/3 = dx + dx 7 0 2 7 0 3 2 5 = 1 − e−1/2 + 1 − e−1/3 = 0.362 7 7 > (1 - exp(-1/2))*5/7 + (1 - exp(-1/3))*2/7 [1] 0.3620406 > # or > pexp(1, 1/2)*5/7 + pexp(1, 1/3)*2/7 [1] 0.3620406
K14521_SM-Color_Cover.indd 209
30/06/15 11:47 am
204
Probability and Statistics with R, Second Edition: Exercises and Solutions
(b) P(Y = 1|X < 2) = 0.2354 > ((1 - exp(-2/3))*2/7) / ((1 - exp(-1))*5/7 + (1 - exp(-2/3))*2/7) [1] 0.2354185 (c) P(X < 2) = 0.5905 Same as (a) with integrals from 0 to 2 rather than 1. > (1 - exp(-1))*5/7 + (1 - exp(-2/3))*2/7 [1] 0.5905384 > # or > pexp(2, 1/2)*5/7 + pexp(2, 1/3)*2/7 [1] 0.5905384
36. An engineering team has designed a lamp with two light bulbs. Let X be the lifetime for bulb 1 and Y the lifetime for bulb 2, both in thousands of hours. Suppose that X and Y are independent and they follow an Exp(λ = 1) distribution. (a) Find the joint density function of X and Y . What is the probability neither bulb lasts longer than 1000 hours? (b) If the lamp works when at least one bulb is lit, what is the probability that the lamp works no more than 2000 hours? (c) What is the probability that the lamp works between 1000 and 2000 hours? Solution: Note that the distribution function of an exponential with λ = 1 is FX (x) = 1 − e−x . (a) fX,Y (x, y) = fX (x) · fY (y) = e−x · e−y = e−(x+y) for x ≥ 0, y ≥ 0 because X and Y are both distributed as Exp(λ = 1) and are independent. 2 P(X < 1, Y < 1) = FX (1) · FY (1) = 1 − e−1 = 0.3996
> (1 - exp(-1))^2 [1] 0.3995764
> # or > pexp(1, 1)*pexp(1, 1) [1] 0.3995764 (b) For the lamp to stop working within 2000 hours, both bulbs must die within 2000 hours. 2 P (X < 2) ∩ (Y < 2) = P(X < 2) · P(Y < 2) = 1 − e−2 = 0.7476
K14521_SM-Color_Cover.indd 210
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
205
> (1 - exp(-2)) * (1 - exp(-2)) [1] 0.7476451 > # or > pexp(2, 1) * pexp(2, 1) [1] 0.7476451 (c) The probability that the lamp works between 1000 and 2000 hours is
1 − e−2
> (1 - exp(-2))^2 -
2
2 − 1 − e−1 = 0.3481.
(1 - exp(-1))^2
[1] 0.3480687 > # or > pexp(2, 1)^2 - pexp(1, 1)^2 [1] 0.3480687
37. The national weather service has issued a severe weather advisory for a particular county that indicates that severe thunderstorms will occur between 9 p.m. and 10 p.m. When the rain starts, the county places a call to the maintenance supervisor who opens the sluice gate to avoid flooding. Assuming the rain’s start time is uniformly distributed between 9 p.m. and 10 p.m., (a) at what time, on the average, will the county maintenance supervisor open the sluice gate? (b) What is the probability that the sluice gate will be opened before 9:30 p.m.? Note: Solve this problem both by hand and using R. Solution: (a) Let X = time maintenance supervisor opens the sluice gate. Then X ∼ Unif (9, 10) and 1 for 9 ≤ x ≤ 10. fX (x) = 10−9 E[X] =
9
10
x · fX (x) dx =
9
10
10 x2 100 − 81 x dx = = 9.5 = 10 − 9 2 9 2
> fx integrate(fx, lower = 9, upper = 10)$value [1] 9.5
On the average, the sluice gate opens at 9:30 p.m. 9.5 (b) P(X < 9 : 30) = P(X < 9.5) = 9 1 dx = x|9.5 9 = 0.5
K14521_SM-Color_Cover.indd 211
30/06/15 11:47 am
206
Probability and Statistics with R, Second Edition: Exercises and Solutions
> fx integrate(Vectorize(fx), lower=9, upper=9.5)$value [1] 0.5 The probability that the sluice gate opens before 9:30 p.m. is 0.5. 38. Assume the distribution of grades for a particular group of students has a bivariate normal distribution with parameters µX = 3.2, µY = 2.4, σX = 0.4, σY = 0.6, and ρ = 0.6, where X and Y represent the grade point averages in high school and the first year of college, respectively. (a) Set the seed equal to 194 (set.seed(194)), and use the function mvrnorm() from the MASS package to simulate the population, assuming the population of interest consists of 200 students. (Hint: Use empirical = TRUE.) (b) Compute the means of X and Y . Are they equal to 3.2 and 2.4, respectively? (c) Compute the variance of X and Y as well as the covariance between X and Y . Are the values 0.16, 0.36, and 0.144, respectively? (d) Create a scatterplot of Y versus X. If a different seed value is used, how do the simulated numbers differ? Solution: (a) > > > > > +
CovXY cov(values) [,1] [,2] [1,] 0.160 0.144 [2,] 0.144 0.360
K14521_SM-Color_Cover.indd 212
30/06/15 11:47 am
Chapter 5:
Multivariate Probability Distributions
207
(d) When a different seed value is used, different values are created. However, the means of the new values will still have means of 3.2, and 2.4, standard deviations of 0.4, and 0.6, for X and Y , respectively with the argument empirical = TRUE. Further, the covariance between X and Y will remain 0.144.
2.5 1.0
1.5
2.0
Y
3.0
3.5
4.0
> plot(values[,1], values[,2], xlab = "X", ylab = "Y", col = "blue", + pch = 19, cex = 0.5)
2.0
2.5
3.0
3.5
4.0
X
39. Show that if X1 , X2 , . . . , Xn are independent random variables with means µ1 , µ2 , . . . , µn n mean and variance of Y = and variances σ12 , σ22 , . . . , σn2 , respectively, then the n 2 2i=1 ci Xi , n 2 where the ci s are real-valued constants, are µY = i=1 ci µi and σY = i=1 ci σi . (Hint: Use moment generating functions.) Solution: Solving directly:
E[Y ] = E
n i=1
K14521_SM-Color_Cover.indd 213
c i Xi =
n i=1
ci E [Xi ] =
n
ci µi
i=1
30/06/15 11:47 am
208
Probability and Statistics with R, Second Edition: Exercises and Solutions
Var[Y ] = E (Y − E[Y ])2 n 2 n =E c i Xi − E c i Xi
=E = =
n
i=1 n
i=1
n i=1
i=1
2 ci (Xi − µi )
c2i E (Xi − µi )2 + 2 ci cj E (Xi − µi )(Xj − µj ) i pt(3, 5) [1] 0.9849504 (b) P(2 < X < 3) = 0.0359 > pt(3, 5) - pt(2, 5) [1] 0.03592012 (c) If P(X < a) = 0.05, a = −2.015. > qt(0.05, 5) [1] -2.015048 3. If (1 − 2t)−5 , t < 12 , is the mgf of a random variable X, find P(X < 15.99). Solution: MX (t) = (1 − 2t)−5 =⇒ X ∼ χ210 , so P(X < 15.99) = 0.9001
209
K14521_SM-Color_Cover.indd 215
30/06/15 11:47 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
210
> pchisq(15.99, 10) [1] 0.900081
4. If X ∼ χ210 , find the constants a and b so that P(a < X < b) = 0.90 and P(X < a) = 0.05. Solution: X ∼ χ210 , P(a < X < b) = 0.90, and P(X < a) = 0.05 =⇒ a = 3.9403, and b = 18.307. > a b c(a, b) [1]
3.940299 18.307038
5. Let X be a χ210 . Calculate P(X < 8) and P(X > 6). Calculate a so that P(X < a) = 0.15. What are the population mean and population variance of X? Solution: P(X < 8) = 0.3712 and P(X > 6) = 0.8153. If P(X < a) = 0.15, then a = 5.5701. The population mean and population variance of a χ210 random variable are 10 and 20, respectively. > pchisq(8, 10) [1] 0.3711631 > pchisq(6, 10, lower = FALSE) [1] 0.8152632 > qchisq(0.15, 10) [1] 5.570059
6. Let X be distributed as an F2,5 . Calculate P(X < 1) and the median of X. Calculate a so that P(X < a) = 0.10. What are the population mean and population variance of X? Solution: P(X < 1) = 0.5688; the median of X = 0.7988; and a = 0.1076 if P(X < a) = 0.10. The population mean and population variance of an F2,5 random variable are 1.6667 and 13.8889, respectively. > pf(1, 2, 5) [1] 0.5687988 > qf(0.50, 2, 5)
K14521_SM-Color_Cover.indd 216
30/06/15 11:47 am
Chapter 6:
Sampling and Sampling Distributions
211
[1] 0.7987698 > qf(0.10, 2, 5) [1] 0.1076122 > EX VX c(EX, VX) [1]
1.666667 13.888889
7. Assume a population with 5 elements: X1 = 0,
X2 = 1,
X3 = 2,
X4 = 3,
X5 = 4.
(a) Calculate µ and σ 2 . (b) Calculate the sampling distribution of the mean for random samples of size 3 taken 2 without replacement. Verify that the mean of X is 2 and that the variance of X is σ6 . (c) Calculate the sampling distribution of X for random samples of size 3 taken with 2 replacement. Verify that the mean of X is 2 and that the variance of X is σn . Solution: (a) The population mean and population variance are 2 and 2, respectively. > > > > >
pop > >
SRS >
RS Sd Sd [1] 4.147288 (e) The mean of the sample mean when sampling without replacement and with replacement are both 8.
K14521_SM-Color_Cover.indd 219
30/06/15 11:47 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
214 > > > > >
xbarSRS > > >
S2srs 2 and
(b) P SXu ≤ 4 . Solution:
Note that (a) P
X S
X−µ √ ∼ tn−1 . s/ n √ > 2 = P(t5 > 2 6 ) = 0.0022
> 1 - pt(sqrt(6)*2, 5) [1] 0.002239216
K14521_SM-Color_Cover.indd 221
30/06/15 11:47 am
Probability and Statistics with R, Second Edition: Exercises and Solutions √ X−0 X X n−1 √ n = Su = (b) ∼ tn−1 . Su n−1 √ Su √ n−1 √ n √ √ √ √ P SXu ≤ 4 = P −4 5 ≤ SXu 5 ≤ 4 5 = P −4 5 ≤ t5 ≤ 4 5 = 0.9997.
216
> pt(4*sqrt(5), 5) - pt(-4*sqrt(5), 5) [1] 0.9997089
12. Constant velocity joints (CV joints) allow a rotating shaft to transmit power through a variable angle, at constant rotational speed, without an appreciable increase in friction or play. An after-market company produces CV joints. To optimize energy transfer, the drive shaft must be very precise. The company has two different branches that produce CV joints where the variability of the drive shaft is known to be 2 mm. A sample of n1 = 10 is drawn from the first branch, and a sample of n2 = 15 is drawn from the second branch. Suppose that the diameter follows a normal distribution. What is the probability that the drive shafts coming from the first branch will have greater variability than those of the second branch? Solution: 2 s P 12 > 1 = P (F9,14 > 1) = 0.4823 s2 > 1 - pf(1, 9, 14) [1] 0.4823316
13. Given a population N (µ, σ) with unknown mean and variance, a sample of size 11 is 2 drawn and the sample variance S 2 is calculated. Calculate the probability P(0.5 < Sσ2 < 1.2). Solution: S2 S2 P 0.5 < 2 < 1.2 = P 10(0.5) ≤ 10 2 ≤ 10(1.2) σ σ = P(5 ≤ χ210 ≤ 12)
= P(χ210 ≤ 12) − P(χ210 ≤ 5) = 0.6061 > pchisq(12, 10) - pchisq(5, 10) [1] 0.6061215
14. The vendor in charge of servicing coffee dispensers is adjusting the one located in the department of statistics. To maximize profit, adjustments are made so that the average quantity of liquid dispensed per serving is 200 milliliters per cup. Suppose the amount of liquid per cup follows a normal distribution and 5.5% of the cups contain more than 224 milliliters.
K14521_SM-Color_Cover.indd 222
30/06/15 11:47 am
Chapter 6:
Sampling and Sampling Distributions
217
(a) Find the probability that a given cup contains between 176 and 224 milliliters. (b) If the machine can hold 20 liters of liquid, find the probability that the machine must be replenished before dispensing 99 cups. (c) If 6 random samples of 5 cups are drawn, what is the probability that the sample mean is greater than 210 milliliters in at least 2 of them? Solution: Let X represent the amount of coffee dispensed in a cup. Then, X ∼ N (200, σ). Given that P(X > 224) = 0.055, σ = 15.017.
P
P(X > 224) = 0.055 X − 200 224 − 200 > = 0.055 σ σ 24 P Z> = 0.055 σ 24 = 0.055 1−P Z ≤ σ 24 P Z≤ = 0.945 σ 24 =⇒ Z0.945 = σ 24 24 = 15.017 = σ= Z0.945 1.598
> sigma sigma [1] 15.01696 (a) So, X ∼ N (200, 15.017). P(176 < X < 224) = P(X < 224) − P(X < 176) = 0.89 > pnorm(224, 200, sigma) - pnorm(176, 200, sigma) # P(176 < X < 224) [1] 0.89 (b) Let Y represent the amount of liquid dispensed in √ 99 cups. That is, Y = i Xi where Xi ∼ N (200, σ). It follows then that Y ∼ N (99 · 200, 99 · σ 2 ). P(Y ≥ 20000) = 1 − P (Y ≤ 20000) = 0.0904. > 1 - pnorm(20000, 99*200, sqrt(99*sigma^2)) [1] 0.0903607 (c) Let W be the number of times the sample mean exceeds 210 out of 6 random samples each of size 5. The probability the sample mean exceeds 210 is P(X > 210) = 1 − P(X ≤ √ 210) = 0.0682 where X ∼ N (200, σ/ 5). P(W ≥ 2) = 1 − P (W ≤ 1) = 0.0581.
K14521_SM-Color_Cover.indd 223
30/06/15 11:47 am
218
Probability and Statistics with R, Second Edition: Exercises and Solutions
> p p [1] 0.06823993 > 1 - pbinom(1, 6, p) [1] 0.05808024
15. The pill weight for a particular type of vitamin follows a normal distribution with a mean of 0.6 grams and a standard deviation of 0.015 grams. It is known that a particular therapy consisting of a box of vitamins with 125 pills is not effective if more than 20% of the pills are under 0.58 grams. (a) Find the probability that the therapy with a box of vitamins is not effective. (b) A supplement manufacturer sells vitamin bottles containing 125 vitamins per bottle with 50 bottles per box with a guarantee that at least 47 bottles per box weigh more than 74.7 grams. Find the probability that a randomly chosen box does not meet the guaranteed weight. Solution: (a) Let X = weight of a particular vitamin. X ∼ N (0.6, 0.015). The therapy is not effective if more than 0.20 × 125 = 25 pills are under 0.58 grams. The probability of a vitamin being underweight is P(X ≤ 0.58) = p =⇒ p = 0.0912. Let W = number of underweight pills. W ∼ Bin(125, p). The probability the therapy is not effective is P(W > 25) = 1 − P (W ≤ 25) = 1e − 04. > p p [1] 0.09121122 > 1 - pbinom(25, 125, p) [1] 5.515032e-05 √ (b) Let Y = weight of a bottle of vitamins. Y ∼ N (125 × 0.6, 125 × 0.0152 ). That is Y ∼ N (75, 0.1677). P(Y > 74.7) = 1 − P (Y ≤ 74.7) = 0.9632. > ans 74.7) > ans [1] 0.9631809 Let V = number of bottles that weigh in excess of 74.7 grams. V ∼ Bin(50, 0.9632) and P(V ≤ 46) = 0.1117. In other words, the probability a randomly selected box does not meet the manufacturers’ guarantee is 0.1117. > pbinom(46, 50, ans)
# P(V > > > > > > + + + > + + + +
set.seed(78) sims + + +
sims + + + > + > + + +
Sampling and Sampling Distributions
221
xbar100 + + > + + + > + >
K14521_SM-Color_Cover.indd 227
xbar300 + + > + + + > + > + + + >
K14521_SM-Color_Cover.indd 228
xbar500 0.45) = P(t33 > −1.55115) = 0.9348
K14521_SM-Color_Cover.indd 229
30/06/15 11:47 am
224
Probability and Statistics with R, Second Edition: Exercises and Solutions
> tobs tobs [1] -1.55115 > 1 - pt(tobs, 33) [1] 0.9347978
19. Plot the density function of an F4,6 random variable. Find the area to the left of x = 3 and shade this region in the original plot. Solution: curve(df(x, 4, 6), from = 0, to = 9, lwd = 2, col = "red", ylab = "",xlab = "") x > > > + > >
0.0
0.1
0.2
0.3
0.4
P(F4, 6 < 3) = 0.8888889
0
2
4
6
8
Similar graph with ggplot:
K14521_SM-Color_Cover.indd 230
30/06/15 11:47 am
Chapter 6: > + + + + + + > > > + + + + + +
Sampling and Sampling Distributions
225
limitRange 0) = 1 − P(Y ≤ 0) = 0.0019. > > > > >
P1 1.54) P2 1.49) PI 1.54| X > 1.49) Ans 149.8) = 0.9772. > 1 - pnorm(149.8, 150, 0.1)
# P(Z > 149.8)
[1] 0.9772499
23. Consider a random sample of size n from an exponential distribution with parameter λ. Use moment generating functions to show that the sample mean follows a Γ(n, λn). Graph the theoretical sampling distribution of X when sampling from an Exp(λ = 1) for n = 30, 100, 300, and 500. Superimpose an appropriate normal density for each Γ(n, λn). At what sample size do the sampling distribution and superimposed density virtually coincide? Solution: −1 For X ∼ Exp(λ), MX (t) = 1 − λt . Also, the moment generating function of a Y ∼ t −n Γ(n, λn) is MY (t) = 1 − λn . n X i n X i X n t· ni Since X = i=1 n , MX (t) = E etX = E et( i=1 n ) = E e i=1 Because the Xi s are independent and identically distributed, −1 −n n n n Xi t t t t· n = E e MX i = 1− = MY (t). 1− = MX (t) = n nλ nλ i=1 i=1 i=1 Note that the sampling distributions of the sample mean are a Γ(30, 30), Γ(100, 100), Γ(300, 300), and Γ(500, 500) for the sample sizes n = 30, 100, 300, and 500, respectively. The normal distribution superimposed over the gamma distributions are N (1, 1/30 ), N (1, 1/100 ), N (1, 1/300 ) , and N (1, 1/500), respectively, since the mean of the gamma is α/λ and the variance is α/λ2 . Code for the graph with Γ(30, 30):
K14521_SM-Color_Cover.indd 233
30/06/15 11:47 am
228
Probability and Statistics with R, Second Edition: Exercises and Solutions
> curve(dgamma(x, 30, 30), from = 1 - 3.5*sqrt(1/30), + to = 1 + 3.5*sqrt(1/30), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/30)), from = 1 - 3.5*sqrt(1/30), + to = 1 + 3.5*sqrt(1/30), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(30, 30))), + expression(N(list(1, sqrt(1/30))))), + text.col=c("blue", "red"), bg = "gray92", cex = 0.90)
0.0
0.5
1.0
1.5
2.0
Γ(30, 30) N(1, 1 30 )
0.4
0.6
0.8
1.0
1.2
1.4
1.6
Code for the graph with Γ(100, 100):
> curve(dgamma(x, 100, 100), from = 1 - 3.5*sqrt(1/100), + to = 1 + 3.5*sqrt(1/100), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/100)), from = 1 - 3.5*sqrt(1/100), + to = 1 + 3.5*sqrt(1/100), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(100, 100))), + expression(N(list(1, sqrt(1/100))))), + text.col=c("blue", "red"), bg = "gray92", cex = 0.90)
K14521_SM-Color_Cover.indd 234
30/06/15 11:47 am
Sampling and Sampling Distributions
229
Γ(100, 100) N(1, 1 100 )
0
1
2
3
4
Chapter 6:
0.7
0.8
0.9
1.0
1.1
1.2
1.3
Code for the graph with Γ(300, 300):
7
> curve(dgamma(x, 300, 300), from = 1 - 3.5*sqrt(1/300), + to = 1 + 3.5*sqrt(1/300), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/300)), from = 1 - 3.5*sqrt(1/300), + to = 1 + 3.5*sqrt(1/300), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(300, 300))), + expression(N(list(1, sqrt(1/300))))), + text.col=c("blue", "red"), bg = "gray92", cex = 0.90)
0
1
2
3
4
5
6
Γ(300, 300) N(1, 1 300 )
0.8
0.9
1.0
1.1
1.2
Code for the graph with Γ(500, 500): > curve(dgamma(x, 500, 500), from = 1 - 3.5*sqrt(1/500), + to = 1 + 3.5*sqrt(1/500), ylab = "", lwd = 2, col = "blue", xlab = "") > curve(dnorm(x, 1, sqrt(1/500)), from = 1 - 3.5*sqrt(1/500), + to = 1 + 3.5*sqrt(1/500), ylab = "", lwd = 2, lty = 2, col = "red", + add = TRUE, xlab = "") > legend(x = "topright", legend = c(expression(Gamma(list(500, 500))),
K14521_SM-Color_Cover.indd 235
30/06/15 11:47 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
230
expression(N(list(1, sqrt(1/500))))), text.col=c("blue", "red"), bg = "gray92", cex = 0.90)
Γ(500, 500) N(1, 1 500 )
0
2
4
6
8
+ +
0.85
0.90
0.95
1.00
1.05
1.10
1.15
24. Set the√seed equal to 10, and simulate 20,000 random samples of size √ nx = 65 from a N (4, σx = 2 ), 20,000 random samples of size ny = 90 from a N (5, σy = 3 ), and verify that the simulated statistic
2 2 Sx /σx Sy2 /σy2
follows an F64,89 distribution.
Solution: There is close agreement between the pink simulated distribution and the blue theoretical F64,89 distribution.
> > > > + + + > + + + + +
K14521_SM-Color_Cover.indd 236
set.seed(10) sims > >
set.seed(95) sims > > > > >
set.seed(368) sims > > >
set.seed(48) sims > > > > > > > >
K14521_SM-Color_Cover.indd 240
set.seed(28) n 1.5). Solution: > > > > > >
K14521_SM-Color_Cover.indd 241
set.seed(37) sims PEb = 1.5) > PEb [1] 0.2664411
K14521_SM-Color_Cover.indd 242
30/06/15 11:47 am
Chapter 6:
Sampling and Sampling Distributions
237
> PTb PTb [1] 0.2648906 > PDb PDb [1] 0.585345
The empirical probability P(F < 2|F > 1.5) = 0.2664, while the theoretical probability P(F1,20 < 2|F1,20 > 1.5) = 0.2649. The percent difference between the empirical and theoretical answers is 0.5853%.
30. Verify empirically that
N (0, 1) 1 2 12 ∼ t5 5 χ5
by setting the seed equal to 36 and generating a sample of size 20,000 from a N (0, 1) distribution. Generate another sample of size 20,000 from a χ25 distribution. Perform the appropriate arithmetic to arrive at the simulated sampling distribution. Create a density histogram of the results and superimpose a theoretical t5 density. Solution:
> > > > > > + + + + +
K14521_SM-Color_Cover.indd 243
set.seed(36) sims 1975 | C1 ) P(C1 ) = 3 i=1 P (W > 1975 | Ci ) P(Ci ) 0.0022 = 0.0084 = 0.2631
P (C1 | W > 1975 ) =
> num num [1] 0.002221388 > den den [1] 0.2630832 > PC1givnW PC1givnW [1] 0.008443672
32. 15.3% of the Spanish Internet domain names are “.org.” If a sample of 2000 Spanish domain names is taken, (a) Calculate the exact probability that at least 300 domain names will be “.org.”. (b) Compute an approximate answer that at least 300 domain names will be “.org.” with a normal approximation. Solution: (a) Let X = number of Spanish Internet domain names that are “.org.”. Then X ∼ Bin(2000, 0.153) and P(X ≥ 300) = 1 − P (X ≤ 299) = 0.6546 > 1 - pbinom(299, 2000, 0.153) [1] 0.6545762 (b) X ∼ N 2000(0.153), 2000(0.153)(1 − 0.153) and P(X ≥ 300) = 1 − P(X ≤ 300) = 0.6453. > 1 - pnorm(300, 2000*0.153, sqrt(2000*0.153*(1 - 0.153))) [1] 0.6453108
33. Set the seed equal to 86, and simulate m1 = 20, 000 samples of size n1 = 1000 from a Bin(n1 , π = 0.3) and m2 = 20, 000 samples of size n2 = 1100 from a Bin(n2 , π = 0.7). Verify that the difference of sampling proportions follows a normal distribution.
K14521_SM-Color_Cover.indd 245
30/06/15 11:47 am
240
Probability and Statistics with R, Second Edition: Exercises and Solutions
Solution: Based on the graph, the shape of the sampling distribution of p1 −p2 is clearly approximately normal. Further, the mean and standard deviation of the simulated sampling distribution are -0.4001 and 0.02 which compare favorably with the theoretical answers of -0.4 and 0.02. > > > > > > > > > > +
set.seed(86) sims + + > + + + + + + + +
K14521_SM-Color_Cover.indd 248
set.seed(679) sims + > + + > > > + + > > + +
K14521_SM-Color_Cover.indd 253
p + + > > > + +
K14521_SM-Color_Cover.indd 255
p lambda plot(lambda, eff(rep(50, length(lambda))), type = "l", + xlab = expression(lambda), ylab = expression(eff(T[2],T[1])), + ylim = c(0.995, 1.005)) > abline(h = 1, lty = "dashed")
0
20
40
60
80
100
λ
K14521_SM-Color_Cover.indd 257
30/06/15 11:47 am
252
Probability and Statistics with R, Second Edition: Exercises and Solutions
(b) T2 is more efficient than T1 regardless of λ for any positive n. For n = 50, eff[T2 , T1 ] = 1.0004. (c) Code to graph eff[T2 , T1 ] for various n: > n plot(n, eff(n), type = "l", xlab = expression(n), + ylab = expression(eff(T[2],T[1])), main = "Relative Efficiency", + col = "red") > abline(h = 1, lty = "dashed")
1.20 1.00
1.10
eff(T2, T1)
1.30
Relative Efficiency
0
5
10
15
20
25
30
n
(d) T2 is moderately more efficient than T1 for small sample sizes. (e) Regardless of the value of λ in a Γ(2, λ) distribution, the estimator T2 is more efficient than T1 . The improvement in efficiency is only marginal and only for relatively small samples. 7. Consider a random variable X ∼ Exp(λ) and two estimators of X: n Xi + 1 . T1 = X and T2 = i=1 n+2
1 λ,
the expected value of
(a) Derive an expression for the relative efficiency of T2 with respect to T1 . (b) Plot eff(T2 , T1 ) versus n values of 1, 2, 3, 4, 20, 25, 30. (c) Generalize your findings. Solution:
K14521_SM-Color_Cover.indd 258
30/06/15 11:47 am
Chapter 7:
Point Estimation
253
(a) For each estimator, expected value, variance, bias, and mean squared error are determined as intermediate steps to the solution of relative efficiency. For T1 : n n 1 ( ) i=1 Xi = nλ = E[T1 ] = E X = E n
1 λ
n 1 n ( ) i=1 Xi = nλ22 = Var[T1 ] = Var X = Var n Bias[T1 ] = E[T1 ] −
1 λ
=
1 λ
−
1 λ
1 nλ2
=0
MSE [T1 ] = Var[T1 ] + (Bias[T1 ])2 =
1 nλ2
+0=
1 nλ2
For T2 : E[T2 ] = E
n
Xi +1 n+2
i=1
Var[T2 ] = Var
n
=
Xi +1 n+2
i=1
Bias[T2 ] = E[T2 ] −
1 λ
=
n
E[Xi ]+1 n+2
i=1
nVar [X] (n+2)2
n λ +1 n+2
−
1 λ
=
MSE [T1 ] MSE [T2 ]
n λ +1 n+2
n λ2
(n+2)2
λ−2 λ(n+2)
=
MSE [T2 ] = Var[T2 ] + (Bias[T2 ])2 = Consequently, eff[T2 , T1 ] =
=
n λ2
(n+2)2
=
+
1 nλ2 n+(λ−2)2 λ2 (n+2)2
λ−2 λ(n+2)
=
2
=
n+(λ−2)2 λ2 (n+2)2
(n+2)2 n(n+λ2 −4λ+4)
(b) The code to plot eff(T2 , T1 ) versus n values of 1, 2, 3, 4, 20, 25, and 30 is
> + + > > > + + > > + + > > + + >
K14521_SM-Color_Cover.indd 259
eff >
xi
π xi (1 − π)120−xi .
parcels > >
X
−n+
√
−n ±
n2 +4n 2n
n n2 + 4n i=1 x2i 2n
n
i=1
Xi2
.
xs >
Point Estimation
261
loglike > > >
Point Estimation
263
x > > > > >
set.seed(11) n > >
Point Estimation
265
set.seed(3) n x, X2 > x, X3 > x)
=1− =1− =1− =1−
3
P (Xi > x)
i=1 3
i=1
(1 − F (x))
3
i=1 3
i=1
1 − 1 − e−x/θ
e−x/θ = 1 − e−3x/θ
3 3 =⇒ f (x) = e−3x/θ , which is an exponential density with parameter θ θ
(b) For an estimator to be considered efficient, its variance must equal the CRLB. That is ? ˆ = Var θ(X)
n·E
= n·E = n·E = n·E
1
∂ ln f (X|θ) ∂θ
1
∂ ln( θ1 e−X/θ ) ∂θ
1
∂ ∂θ
− θ1
+
X θ2
θ4 n · E [(X − θ)2 ] θ4 = n · Var[X] θ2 θ4 = = n · θ2 n =
Since Var θˆ3 (X) =
θ2 n,
2
− ln(θ) −
1
2
2
X θ
2
it is efficient.
(c) θˆ3 (X) is the MLE because it is efficient.
K14521_SM-Color_Cover.indd 275
30/06/15 11:48 am
270
Probability and Statistics with R, Second Edition: Exercises and Solutions
(d) If X ∼ Exp(θ + 2), E[X] = θ + 2. To create an unbiased estimator of θ, a statistic which has an expected value of θ − 2 can be used. Any of θ1 , θ2 , or θ3 minus 2 will yield an unbiased estimator of θ. 20. Consider a random sample of size n from a population of size N , where the items in the population are sequentially numbered from 1 to N . (a) Derive the method of moments estimator of N . (b) Derive the maximum likelihood estimator of N . (c) What are the method of moments and maximum likelihood estimates of N for this sample of size 7: {2, 5, 13, 6, 15, 9, 21}? Solution: set
(a) To derive the method of moments estimator, set α1 = E[X 1 ] = X = m1 . E[X] =
N i=1
N (N + 1) 1 N +1 1 = · = N 2 N 2
xi ·
N +1 =X 2 N + 1 = 2X N = 2X − 1 = 2X − 1 =⇒ N
(b)
P (X = x) = This means the likelihood function is L(N |x) =
1 Nn
0
1 N
0
if x = 1, . . . , N otherwise
N ≥ max{x1 , . . . , xn } otherwise
Since this function decreases as N increases, the smallest value of N which produces a (x) = max{x1 , . . . , xn }. non-zero value is N
(c) > > > >
x
lifetimes mle mle [1] 0.199561
ˆ For this sample, λ(x) = 0.1996. (c)
> + + + > >
loglike LBWT LBWT low 0 130
1 59
> prop.table(LBWT)
K14521_SM-Color_Cover.indd 279
30/06/15 11:48 am
274
Probability and Statistics with R, Second Edition: Exercises and Solutions
low 0 1 0.6878307 0.3121693 (c) The maximum likelihood estimator of π for a Bernoulli distribution is X. Consequently, the mle is x ¯ = 0.3122. (d) The maximum likelihood estimate for children born with weight problems is approximately 31% for this particular hospital. 23. In 1876, Charles Darwin had his book The Effect of Cross- and Self-Fertilization in the Vegetable Kingdom published. Darwin planted two seeds, one obtained by cross-fertilization and the other by auto-fertilization, in two opposite but separate locations of a pot. Selffertilization, also called autogamy or selfing, is the fertilization of a plant with its own pollen. Cross-fertilization, or allogamy, is the fertilization with pollen of another plant, usually of the same species. Darwin recorded the plants’ heights in inches. The data frame FERTILIZE from the PASWR2 package contains the data from this experiment.
Cross-fert
23.5 12.0 21.0 22.0 19.1 21.5 22.1 20.4 18.3 21.6 23.3 21.0 22.1 23.0 12.0 Self-fert 17.4 20.4 20.0 20.0 18.4 18.6 18.6 15.3 16.5 18.0 16.3 18.0 12.8 15.5 18.0
(a) Create a variable DD defined as the difference between the variables cross and self. (b) Perform an exploratory analysis of DD to see if DD might follow a normal distribution. (c) Use the function fitdistr() found in the MASS package to obtain the maximum likelihood estimates of µ and σ if DD did follow a normal distribution. (d) Verify that the results from (c) are the sample mean and the uncorrected sample standard deviation of DD. Solution: (a)
K14521_SM-Color_Cover.indd 280
> > + + >
TwoColumns shapiro.test(ThreeColumns$DD) Shapiro-Wilk normality test data: ThreeColumns$DD W = 0.90079, p-value = 0.09785 10
0.09
density
sample
5
0.06
0
0.03 −5
0.00 −5
0
DD
5
10
−2
−1
0
theoretical
1
2
Based on the density estimate, the quantile-quantile plot, and the Shapiro-Wilk normality test (using an α level of 0.05) one might assume normality for the distribution of DD. (c) > library(MASS) > ans ans mean sd 2.6166667 4.5580667 (1.1768878) (0.8321853) > MU MU mean 2.616667 > SIGMA SIGMA
K14521_SM-Color_Cover.indd 281
30/06/15 11:48 am
276
Probability and Statistics with R, Second Edition: Exercises and Solutions
sd 4.558067 The maximum likelihood estimates of µ and σ using the function fitdistr() assuming a normal distribution are: 2.6167 and 4.5581, respectively. (d) The following code verifies that the numbers 2.6167 and 4.5581 are the sample mean and the uncorrected sample standard deviation of DD. > with(data = ThreeColumns, + c(mean(DD), sqrt(var(DD)*(length(DD) - 1)/length(DD)) ) + ) [1] 2.616667 4.558067
24. The lognormal distribution has the following density function: g(w) =
1 √
wσ 2π
1 (ln w − µ)2 2 2σ , e −
w ≥ 0,
−∞ < µ < ∞,
σ>0
where ln(W ) ∼ N (µ, σ). The mean and variance of W are, respectively, E[W ] = eµ+
σ2 2
and
2
2
Var[W ] = e2µ+σ (eσ − 1).
Find the maximum likelihood estimators for E[W ] and Var[W ]. Solution: Given the MLEs for a normal distribution with mean µ and variance σ 2 are µ ˆ(X) = X and n
(X −µ)2
= Su2 , respectively, and the fact that MLEs are invariant, it follows σ ˆ 2 (X) = i=1 n i that the MLEs for the mean and variance of W are 2 E[w] = eX+Su /2
and
2 2 Var[w] = e2X+Su (eSu − 1).
25. Consider the variable brain from the Animals data frame in the MASS package. (a) Estimate with maximum likelihood techniques the mean and variance of brain. Specifically, use the R function fitdistr() with a lognormal distribution. (b) Suppose that brain is a lognormal variable; then the log of this variable is normal. To check this assertion, plot the cumulative distribution function of brain versus a lognormal cumulative distribution function. In another plot, represent the cumulative distribution function of log-brain versus a normal cumulative distribution function. Is it reasonable to assume that brain follows a lognormal distribution? (c) Find the mean and standard deviation of brain assuming a lognormal distribution. (d) Repeat this exercise without the dinosaurs. Comment on the changes in the mean and variance estimates.
K14521_SM-Color_Cover.indd 282
30/06/15 11:48 am
Chapter 7:
Point Estimation
277
Solution: (a) The MLEs for the mean and variance of W are
2 E[w] = eX+Su /2
and
2 2 Var[w] = e2X+Su (eSu − 1).
> library(MASS) > mle mle meanlog sdlog 4.4254457 2.3560485 (0.4452513) (0.3148402) > EW VW c(EW, VW) meanlog meanlog 1.340674e+03 4.610096e+08
The maximum likelihood estimates of the mean and standard deviation for the logarithm of the variable brain are 4.4254 and 2.356, respectively. It follows using invariance properties of MLEs that the estimates of the mean and variance of brain are 1340.6744 kg and 461009623.032 kg2 , respectively. (b)
> ggplot(data = Animals, aes(x = brain)) + stat_ecdf() + + stat_function(fun = plnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw() > # > ggplot(data = Animals, aes(x = log(brain))) + stat_ecdf() + + stat_function(fun = pnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw()
K14521_SM-Color_Cover.indd 283
30/06/15 11:48 am
278
Probability and Statistics with R, Second Edition: Exercises and Solutions
0.75
0.75
0.50
0.50
y
1.00
y
1.00
0.25
0.25
0.00
0.00 0
2000
brain
4000
6000
0.0
2.5
5.0
log(brain)
7.5
It seems reasonable to assume brain follows a lognormal distribution based on the graphs. (c) In agreement with part (a), the estimated mean and variance of the variable brain are 1340.6744 kg and 461009623.032 kg2 , respectively. > ans ans [1] 4.425446 5.756556 > xbar V c(xbar, V) [1] 4.425446 5.756556 > > > >
VU ggplot(data = NoDinos, aes(x = brain)) + stat_ecdf() + + stat_function(fun = plnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw() > # > ggplot(data = NoDinos, aes(x = log(brain))) + stat_ecdf() + + stat_function(fun = pnorm, + args = list(mle$estimate[1], mle$estimate[2])) + + theme_bw()
0.75
0.75
0.50
0.50
y
1.00
y
1.00
0.25
0.25
0.00
0.00 0
2000
brain
4000
6000
0.0
2.5
5.0
log(brain)
7.5
It seems reasonable to assume brain still follows a lognormal distribution after removing the three dinosaurs based on the graphs. In agreement with R Code 7.3 on the preceding page, the estimated mean and variance of the variable brain are 1851.1266 kg and 1668525179.2198 kg2 , respectively. > ans ans [1] 4.428471 6.448081 > xbar V c(xbar, V) [1] 4.428471 6.448081 > VU EW VW c(EW, VW) [1] 1.851127e+03 1.668525e+09 After removing the three dinosaurs, the estimate of the mean brain weight has increased while the estimate of the variance of the brain weight has decreased. 26. The data in GD available in the PASWR2 package are the times until failure in hours for a particular electronic component subjected to an accelerated stress test.
(a) Find the method of moments estimates of α and λ if the data come from a Γ(α, λ) distribution.
(b) Create a density histogram of times until failure. Superimpose a gamma distribution using the estimates from part (a) over the density histogram.
(c) Find the maximum likelihood estimates of α and λ if the data come from a Γ(α, λ) distribution by using the function fitdistr() from the MASS package.
(d) Create a density histogram of times until failure. Superimpose a gamma distribution using the estimates from part (c) over the density histogram.
(e) Plot the cumulative distribution for time until failure. Superimpose the theoretical cumulative gamma distribution using both the method of moments and the maximum likelihood estimates of α and λ. Which estimates appear to model the data better?
Solution: (a) When X ∼ Γ(α, λ), E[X] =
α λ
and Var[X] =
α λ2 .
To find the MOM estimates, the system of equations to be solved is
n α set i=1 Xi α1 (α, λ) = E[X] = = m1 = λ n n 2 2 α α 2 set i=1 Xi = m2 α2 (α, λ) = E[X 2 ] = Var[X] + E[X] = 2 + = λ λ n From the first equation, α = λ gives
K14521_SM-Color_Cover.indd 286
λ
n
i=1
n
Xi
. Substituting into the second equation to solve for
30/06/15 11:48 am
Chapter 7:
Point Estimation
281
n α 2 2 α i=1 Xi + = 2 λ λ n n 2 2 λ i=1 Xi α + α2 = n n n n 2 2 2 λ i=1 Xi λ i=1 Xi λ i=1 Xi + = n n n n n 2 n 2 i=1 Xi i=1 Xi i=1 Xi =λ − n n n X = λSu2 λ=
X Su2
˜= X =⇒ λ Su2 Solving for α now gives α= α=
λ
n
Xi n n
i=1
X · Su2 2
α=
X Su2
=⇒ α ˜=
X Su2
i=1
Xi
n
2
> estimates estimates [1] 10.52530 11.68381 > Alpha Lambda c(Alpha, Lambda) [1] 9.4816647 0.9008451
2
X 10.52532 α ˜= 2 = = 9.4817 Su 11.6838 ˜ = X = 10.5253 = 0.9008 λ Su2 11.6838 (b)
K14521_SM-Color_Cover.indd 287
30/06/15 11:48 am
282
Probability and Statistics with R, Second Edition: Exercises and Solutions
> ggplot(data = GD, aes(x = attf)) + + geom_histogram(aes(x = attf, y = ..density..), binwidth = 1.5, + fill = "blue", alpha = 0.3) + + stat_function(fun = dgamma, args = list(Alpha, Lambda), + color = "blue") + + theme_bw()
0.125
0.100
density
0.075
0.050
0.025
0.000 5
10
attf
15
20
25
(c) > library(MASS) > mle mle shape rate 9.3961672 0.8918924 (1.3058899) (0.1273251) The maximum likelihood estimates of α, and λ are 9.3962 and 0.8919, respectively. (d) > ggplot(data = GD, aes(x = attf)) + + geom_histogram(aes(x = attf, y = ..density..), binwidth = 1.5, + fill = "red", alpha = 0.3) + + stat_function(fun = dgamma, + args = list(mle$estimate[1], mle$estimate[2]), + color = "red") + + theme_bw()
K14521_SM-Color_Cover.indd 288
30/06/15 11:48 am
Chapter 7:
Point Estimation
283
0.125
0.100
density
0.075
0.050
0.025
0.000 5
10
attf
15
20
25
(e)
> ggplot(data = GD, aes(x = attf)) + + stat_ecdf() + + stat_function(fun = pgamma, args = list(Alpha, Lambda), + color = "blue") + + stat_function(fun = pgamma, + args = list(mle$estimate[1], mle$estimate[2]), + color = "red") + + theme_bw()
K14521_SM-Color_Cover.indd 289
30/06/15 11:48 am
284
Probability and Statistics with R, Second Edition: Exercises and Solutions 1.00
0.75
y
0.50
0.25
0.00 5
10
15
attf
20
The cumulative density using method of moment estimates for α and λ is visually indistinguishable from the cumulative density using maximum likelihood estimates for α and λ.
27. The time a client waits to be served by the mortgage specialist at a bank has density function f (x) =
1 2 −x/θ x e 2θ3
x > 0, θ > 0.
(a) Derive the maximum likelihood estimator of θ for a random sample of size n.
(b) Show that the estimator derived in (a) is unbiased and efficient.
(c) Derive the method of moments estimator of θ.
(d) If the waiting times of 15 clients are 6, 12, 15, 14, 12, 10, 8, 9, 10, 9, 8, 7, 10, 7, and 3 minutes, compute the maximum likelihood estimate of θ.
Solution:
K14521_SM-Color_Cover.indd 290
30/06/15 11:48 am
Chapter 7:
Point Estimation
285
(a) L(θ|x) =
n 1 2 − xi x e θ 2θ3 i i=1
n 1 2 − xi x e θ = n 3n 2 θ i=1 i
ln L(θ|x) = −n ln 2 − 3n ln θ + 2 3n ∂ ln L(θ|x) =− + ∂θ θ
n
i=1 θ2
xi
=⇒ −3nθ = − θ=
n i=1
ln xi −
n
i=1
xi
θ
set
=0
n
xi
i=1 n i=1
xi
3n
X ˆ =⇒ θ(X) = 3
To verify this is a maximum value, take the second partial and show it is less than zero. n 3n ∂ 2 ln L(θ|x) i=1 xi = − 2 ∂θ2 θ2 θ3 ˆ x¯ θ= 3
27n 2n¯ x − 3 x ¯2 x ¯ /27 27n 54n = 2 − 2 < 0 for all n. x ¯ x ¯
=
ˆ (b) Show that θ(X) = Unbiased:
X 3
is unbiased and efficient.
E[X] =
∞
0
Let u = xθ , and note that du =
x·
1 2 −x/θ x e dx 2θ3
dx θ
=
∞ 0
u3 −u e θ du 2
θ = Γ(4) 2 θ = 3! 2 E[X] = 3θ
K14521_SM-Color_Cover.indd 291
30/06/15 11:48 am
286
Probability and Statistics with R, Second Edition: Exercises and Solutions
n E[Xi ] n · 3θ X ˆ = i=1 = =θ E θ(X) = E 3 3n 3n
ˆ Therefore, θ(X) =X 3 is an unbiased estimator for θ. For later calculations, E[X 2 ] is also needed. E[X 2 ] = Let u = xθ , and note that du =
∞
1 2 −x/θ x e dx 2θ3
x2 ·
0
dx θ
=θ
∞
u4 −u e θ du 2
0
θ2 = Γ(5) 2 θ2 = 4! 2 2 E[X ] = 12θ2 Var[X] = E[X 2 ] − (E[X])2 = 12θ2 − (3θ)2 = 3θ2 For an estimator to be considered efficient, its variance must equal the CRLB. That is ? ˆ Var θ(X) = Var
1 · Var 9
n
i=1
n
X ? = 3
Xi
n·E
n·E
?
∂ ln f (X|θ) ∂θ
1
2
∂ ln( 2θ13 X 2 e−X/θ ) ∂θ
2
n·E
n·E
1
=
n 1 ? · Var Xi = 9n2 i=1
1
∂ ∂θ
− ln 2 − 3 ln θ + 2 ln X −
1
− θ3 +
X θ2
n 1 θ4 ? · Var[X ] = i 9n2 i=1 n · E [(X − 3θ)2 ]
2
X θ
2
θ4 1 2 ? · n · 3θ = 9n2 n · Var[X] 2 θ2 θ θ4 = = 3n n · 3θ2 3n
ˆ So, θ(X) =
K14521_SM-Color_Cover.indd 292
X 3
is an efficient estimator of θ.
30/06/15 11:48 am
Chapter 7:
Point Estimation
287
(c) To derive the method of moments estimator, solve set
α1 = E[X] = 3θ = X = m1 X 3 X =⇒ θ˜ = 3 θ=
(d) > > > >
waiting + + > >
loglike 0.
(a) What distribution has this density function? Be sure to specify the parameter. (b) Find the maximum likelihood estimator of θ for random samples of size n. (c) Find the asymptotic variance of the maximum likelihood estimator. (d) Find the method of moments estimator of θ for a random sample of size n. (e) Calculate the maximum likelihood and method of moments estimates of θ for the sample {0.1, 0.7, 0.5, 0.85, 0.9}.
Solution: (a) It is a β(α = θ, β = 1, A = 0, B = 1). α−1 β−1 Γ(α + β) x−A B−x 1 · · f (x) = B − A Γ(α)Γ(β) B−A B−a θ−1 1−1 Γ(θ + 1) 1 x−0 1−x · · f (x) = 1 − 0 Γ(θ)Γ(1) 1−0 1−0 θΓ(θ) θ−1 0 · (x) f (x) = (1 − x) Γ(θ)Γ(1) f (x) = θxθ−1
(b) To find the MLE, set the partial of the log likelihood equal to zero, and solve for θ.
K14521_SM-Color_Cover.indd 300
30/06/15 11:48 am
Chapter 7:
Point Estimation
L(θ|x) = θn
n
295
xθ−1 i
i=1
ln L(θ|x) = n ln(θ) + (θ − 1) n
n
ln(xi )
i=1
∂ ln L(θ|x) n set = + ln(xi ) = 0 ∂θ θ i=1 −n i=1 ln(xi ) −n ˆ =⇒ θ(X) = n i=1 ln(Xi ) =⇒ θ = n
(c) The lower bound on the variance is the reciprocal of the information In (θ)−1 .
∂ 2 ln L(θ|x) In (θ) = −E ∂θ2 −n =− θ2 2 θ =⇒ In (θ)−1 = n
(d) Finding the MOM estimator: Recall that the mean of a beta distribution is α θ θ A + (B − A) = 0 + (1 − 0) = α+β θ+1 θ+1
(7.1)
for this particular distribution. set
α1 = E[X] = X = m1 θ =X θ+1 θ = θX + X θ(1 − X) = X
X 1−X X =⇒ θ˜ = 1−X θ=
(e) Calculations: > > > >
x > >
x f ggplot(data = data.frame(x = x), aes(x = x)) + + geom_histogram(aes(x = x, y = ..density..), fill = "red", + alpha = 0.5) + + theme_bw() + + stat_function(fun = f)
K14521_SM-Color_Cover.indd 306
30/06/15 11:48 am
Chapter 7:
Point Estimation
301
3
density
2
1
0 0.00
0.25
0.50
x
0.75
(d) To calculate the MLE of θ, take the partial of the log likelihood, set it equal to zero, and solve for θ.
L(θ|x) =
n
3
3πθx2i e−θπxi
i=1
L(θ|x) = (3πθ)n e−θπ
n
i=1
x3i
n
x2i
i=1
ln L(θ|x) = n ln(3π) + n ln(θ) − θπ n n ∂ ln L(θ|x) set = −π x3i = 0 ∂θ θ i=1
n n −π x3i = 0 θ i=1
n = θπ
n
n
x3i
i=1
+
n
2 ln(xi )
i=1
x3i
i=1
=⇒ θ = ˆ =⇒ θ(X) = For the generated sample,
K14521_SM-Color_Cover.indd 307
π
n n
i=1
n
π
n
i=1
x3i Xi3
30/06/15 11:48 am
302
Probability and Statistics with R, Second Edition: Exercises and Solutions
> n mle mle [1] 5.038159
ˆ θ(X) = (e)
π
n n
3 i=1 Xi
=
500 = 5.0382. (3.1416)(1263.5961)
> loglike > > > + + + >
set.seed(8675) n library(MASS) > fitdistr(x = x, densfun = "beta", start = list(shape1 = 1 , shape2 = 1)) Warning in densfun(x, parm[1], parm[2], ...): Warning in densfun(x, parm[1], parm[2], ...):
NaNs produced NaNs produced
shape1 shape2 3.0810430 2.1497909 (0.2422334) (0.1633135)
38. Consider a random sample of size n from an exponential distribution with pdf f (x) =
1 −x e θ θ
x ≥ 0,
θ > 0.
(7.2)
(a) Find the MLE of θ. (b) Given the answer in part (a), what is the MLE of θ2 ? Solution:
K14521_SM-Color_Cover.indd 310
30/06/15 11:48 am
Chapter 7:
Point Estimation
305
(a) To find the MLE of θ, set the partial of the log-likelihood function equal to zero and solve for θ. 1 −x e θ θ n 1 − xi L(θ|x) = n e θ θ i=1 f (x) =
=
1 − ni=1 e θn
ln L(θ|x) = −n ln(θ) − n
xi θ
n xi i=1
θ
n xi set ∂ ln L(θ|x) =− + =0 ∂θ θ i=1 θ2
n
−
n xi set + =0 θ i=1 θ2 −nθ = −
n
θ=
i=1 n i=1
ˆ =⇒ θ(X) =
i=1
nn
n
xi xi Xi
=X 2
(b) Since X is the MLE of θ, using the invariance property of MLEs, the MLE of θ2 is X .
K14521_SM-Color_Cover.indd 311
30/06/15 11:48 am
K14521_SM-Color_Cover.indd 312
30/06/15 11:48 am
Chapter 8 Confidence Intervals
1. Is [¯ x − 3, x ¯ + 3] a confidence interval for the population mean of a normal distribution? Why or why not? Solution: The interval [¯ x − 3, x ¯ + 3] is a confidence interval since it takes the form of point estimate minus and plus the margin of error. 2. Explain how to construct a 95% confidence interval for the population mean of a normal distribution if σ is known. Solution: Using the confidence interval for µ when sampling from a normal distribution with known variance, one would compute the sample mean and substitute that value for x ¯ as well as 1.96 for z1−α/2 into the equation along with the values for σ and n. 3. Given a random sample {X1 , X2 , . . . , Xn } from a normal population N (µ, σ), where σ is known: (a) What is the confidence level for the interval x ¯ ± 2.053749 √σn ? (b) What is the confidence level for the interval x ¯ ± 1.405072 √σn ? (c) What is the value of the percentile zα/2 for a 99% confidence interval? Solution: (a) The confidence level is 0.96. > 1 - pnorm(-2.053749)*2 [1] 0.96 (b) The confidence level is 0.84. > 1 - pnorm(-1.405072)*2 [1] 0.8400001 (c) To answer the problem, one must find the value of z0.005 = −2.5758. > qnorm(0.005) [1] -2.575829
307
K14521_SM-Color_Cover.indd 313
30/06/15 11:48 am
308
Probability and Statistics with R, Second Edition: Exercises and Solutions
4. Given a random sample {X1 , X2 , . . . , Xn } from a normal population N (µ, σ), where σ is known, consider the confidence interval x ¯ ± z1−α/2 √σn for µ. (a) Given a fixed sample size n, explain the relationship between the confidence level and the precision of the confidence interval. (b) Given a confidence level (1 − α)%, explain how the precision of the confidence interval changes with the sample size. Solution: (a) As the confidence level for the confidence interval increases so does the value z1−α/2 , which subsequently increases the width of the confidence interval, thereby decreasing precision of the confidence interval. In a similar fashion, as the confidence level decreases, the confidence interval width decreases, increasing the precision of the confidence interval. (b) With a fixed confidence level, as the size of the sample (n) increases, the margin of error decreases, producing a more precise confidence interval. Likewise, as the sample size (n) decreases, the margin of error increases, resulting in a less precise confidence interval. 5. Given a normal population with known variance σ 2 , by what factor must the sample size be increased to reduce the length of a confidence interval for the mean by a factor of k? Solution: The length of a confidence interval for the mean given a normal population with known variance σ 2 is 2 · z1−α/2 √σn . Reducing the length by a factor of k implies that the ratio of the original length to the new length will equal k. σ 2 · z1−α/2 √noriginal
=k 2 · z1−α/2 √nσnew √ noriginal =k √ nnew =⇒ nnew = k 2 · noriginal
Therefore, to reduce the length of a confidence interval by a factor of k requires a sample size of k 2 noriginal . For example, to reduce the length of a confidence interval by a factor of 3 requires a sample size of 9n. 6. A historic data set studied by R.A. Fisher is the measurements in centimeters of four flower parts (sepal length, sepal width, petal length, and petal width) on 50 specimens for each of three species of irises (Setosa, Versicolor, and Virginica). The data are stored in the data frame iris (Fisher, 1936). (a) Analyze the sepal lengths for Setosa, Versicolor, and Virginica irises, and comment on the characteristics of their distributions. (b) Based on the analysis from part (a), construct an appropriate 99% confidence interval for the mean sepal length of Setosa irises.
K14521_SM-Color_Cover.indd 314
30/06/15 11:48 am
Chapter 8:
Confidence Intervals
309
Solution: (a) ggplot(data = iris, aes(x = Sepal.Length)) + geom_density(fill = "yellow", alpha = 0.5) + facet_grid(Species ~ .) + theme_bw() ggplot(data = iris, aes(sample = Sepal.Length)) + stat_qq() + facet_grid(Species ~ .) + theme_bw() means + + + > > >
6 5
5
6
Sepal.Length
7
8
−2
−1
0
theoretical
1
2
The sepal lengths for Setosa irises are symmetric and unimodal centered at 5.006 cm with a standard deviation of 0.3525 cm. The sepal lengths for Versicolor irises are symmetric and unimodal centered at 5.936 cm with a standard deviation of 0.5162 cm. The sepal lengths for Virginica irises are symmetric and unimodal centered at 6.588 cm with a standard deviation of 0.6359 cm. (b) > CI CI [1] 4.872406 5.139594 attr(,"conf.level") [1] 0.99 A 99% confidence interval is [4.8724, 5.1396] for the mean sepal length of Setosa irises.
K14521_SM-Color_Cover.indd 315
30/06/15 11:48 am
310
Probability and Statistics with R, Second Edition: Exercises and Solutions
7. Surface-water salinity measurements were taken in a bottom-sampling project in Whitewater Bay, Florida. These data are stored in the data frame SALINITY in the PASWR2 package. Geographic considerations lead geologists to believe that the salinity variation should be normally distributed. If this is true, it means there is free mixing and interchange between open marine water and fresh water entering the bay (Davis, 1986). (a) Construct a quantile-quantile plot of the data. Does this plot rule out normality? (b) Construct a 95% confidence interval for the mean salinity variation. Solution: (a) The quantile-quantile plot does not rule out normality. > ggplot(data = SALINITY, aes(sample = salinity)) + + stat_qq() + + theme_bw() 80
70
sample
60
50
40
−2
−1
0
theoretical
1
2
(b) > CI CI [1] 46.85025 52.23308 attr(,"conf.level") [1] 0.95 A 95% confidence interval for the mean salinity variation is [46.8502, 52.2331]. 8. The survival times in weeks for 20 male rats that were exposed to a high level of radiation are
K14521_SM-Color_Cover.indd 316
30/06/15 11:48 am
Chapter 8: 152 125
152 40
115 128
109 123
Confidence Intervals 137 136
88 101
94 77 62 153
311 160 83
165 69
Data are from Lawless (1982) and are stored in the data frame RAT. (a) Construct a quantile-quantile plot of the survival times. Based on the quantile-quantile plot, can normality be ruled out? (b) Construct a 92% confidence interval for the average survival time for male rats exposed to high levels of radiation. Solution: (a) The quantile-quantile plot does not rule out normality. > ggplot(data = RAT, aes(sample = survival.time)) + + stat_qq() + + theme_bw()
160
sample
120
80
40 −2
−1
0
theoretical
1
2
(b) > CI CI [1] 98.6486 128.2514 attr(,"conf.level") [1] 0.92 A 92% confidence interval for the mean rat survival time for male rats exposed to high levels of radiation is [98.6486, 128.2514]. 9. A large company wants to estimate the proportion of its accounts that are paid on time.
K14521_SM-Color_Cover.indd 317
30/06/15 11:48 am
312
Probability and Statistics with R, Second Edition: Exercises and Solutions
(a) How large a sample is needed to estimate the true proportion within 3% with a 96% confidence level? (b) Suppose 650 out of 800 accounts are paid on time. Construct 95% confidence intervals for the true proportion of accounts that are paid on time using an asymptotic confidence interval, a score confidence interval, an Agresti-Coull confidence interval, and the Clopper-Pearson confidence interval. Solution: (a) > > > > >
p binom.confint(x = 650, n = 800, conf.level = 0.95, methods = "asymptotic") method x n mean lower upper 1 asymptotic 650 800 0.8125 0.7854532 0.8395468 > > > > > >
x > >
ntilde CI CI [1] -1.054064 1.659598 attr(,"conf.level") [1] 0.95 A 95% confidence interval for the true average difference between females and males is [−1.0541, 1.6596]. Note that this interval does contain 0 indicating that the there is not enough evidence to suggest there are actual gender differences with respect to temperature. (b) > CI CI [1] -2.496501 -0.256440 attr(,"conf.level") [1] 0.95 A 95% confidence interval for the true average difference between students taking their temperatures at 8 a.m. and students taking their temperatures at 9 a.m. is [−2.4965, −0.2564]. Note that this interval does not contain 0, indicating that the there is evidence to suggest students in the 8 a.m. class have temperatures that are not as warm as the 9 a.m. class. One possible explanation is that students roll straight out of bed and into the 8 a.m. class. Consequently, their temperatures are closer to their sleeping temperatures which are lower than their waking temperature’s. 11. The Cosmed K4b2 is a portable metabolic system. A study at Appalachian State University compared the metabolic values obtained from the Cosmed K4b2 to those of a reference unit (Amatek) over a range of workloads from easy to maximal to test the validity and reliability of the Cosmed K4b2 . A small portion of the results for VO2 (ml/kg/min) measurements taken at a 150 watt workload are stored in data frame COSAMA and in the following table: Subject 1 2 3 4 5 6 7
Cosmed 31.71 33.96 30.03 24.42 29.07 28.42 31.90
Amatek 31.20 29.15 27.88 22.79 27.00 28.09 32.66
Subject 8 9 10 11 12 13 14
Cosmed 30.33 30.78 30.78 31.84 22.80 28.99 30.80
Amatek 27.95 29.08 28.74 28.75 20.20 29.25 29.13
(a) Construct a quantile-quantile plot for the between-system differences. (b) Are the VO2 values reported for Cosmed and Amatek independent?
K14521_SM-Color_Cover.indd 320
30/06/15 11:48 am
Chapter 8:
Confidence Intervals
315
(c) Construct a 95% confidence interval for the average VO2 system difference. Solution: (a) Based on the quantile-quantile plot of the differences between the Cosmed and Amatek VO2 values, there is no reason to rule out normality. > COSAMA ggplot(data = COSAMA, aes(sample = DIFF)) + + stat_qq(color = "red", size = 3) + + theme_bw() 5
4
sample
3
2
1
0
−1
−1
0
theoretical
1
(b) The reported Cosmed and Amatek values are dependent as each subject in the study has both a Cosmed and an Amatek VO2 score. (c)
K14521_SM-Color_Cover.indd 321
30/06/15 11:48 am
316
Probability and Statistics with R, Second Edition: Exercises and Solutions
> CI CI [1] 0.8746899 2.5353101 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the average VO2 differenceis [0.8747, 2.5353]. 12. Let {X1 , . . . , X19 } and {Y1 , . . . , Y15 } be two random samples from a N (µX , σ) and a ¯ = 57.3, s2X = 8.3, y¯ = 65.6, and s2Y = 9.7. Find a N (µY , σ), respectively. Suppose that x 96% confidence interval for µX , µY , and µX − µY . Solution: > > > > > > > + >
mean.x
xbar1 > > > > >
N > >
N > > > >
N CISIM CIsimRV CI CI [1] 0.07262507 0.11694046 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of schizophreniform patients admitted to Virgen del Camino is [0.0726, 0.1169]. > CI CI [1] 0.04094285 0.07631626 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of schizoaffective patients admitted to Virgen del Camino is [0.0409, 0.0763]. > CI CI [1] 0.07667093 0.12193291 attr(,"conf.level") [1] 0.95
K14521_SM-Color_Cover.indd 348
30/06/15 11:49 am
Chapter 8:
Confidence Intervals
343
The 95% confidence interval for the proportion of bipolar patients admitted to Virgen del Camino is [0.0767, 0.1219]. > CI CI [1] 0.02455615 0.05353698 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of delusional patients admitted to Virgen del Camino is [0.0246, 0.0535]. > CI CI [1] 0.06324816 0.10522800 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of psychotic patients admitted to Virgen del Camino is [0.0632, 0.1052]. > CI CI [1] 0.03455100 0.06764427 attr(,"conf.level") [1] 0.95 The 95% confidence interval for the proportion of atypical psychosis patients admitted to Virgen del Camino is [0.0346, 0.0676]. 36. Find the required sample size (n) to estimate the proportion of students spending more than e 10 a week on entertainment with a 95% confidence interval so that the margin of error is no more than 0.02. Solution:
n = p(1 − p)
z
1−α/2
B
2
= 0.5(1 − 0.5)
z
0.975
0.02
2
= 0.25
1.96 0.02
2
= 2400.9118
> n c(n, ceiling(n)) [1] 2400.912 2401.000 > # or > nsize(b = 0.02, p = 0.5, conf.level = 0.95, type = "pi") The required sample size (n) to estimate the population proportion of successes with a 0.95 confidence interval so that the margin of error is no more than 0.02 is 2401 .
K14521_SM-Color_Cover.indd 349
30/06/15 11:49 am
344
Probability and Statistics with R, Second Edition: Exercises and Solutions
One must sample at least 2401 students to be 95% confident the margin of error in estimating the true proportion of students spending more than e 10 a week on entertainment is no more than 0.02.
K14521_SM-Color_Cover.indd 350
30/06/15 11:49 am
Chapter 9 Hypothesis Testing
1. Define α and β for a test of hypothesis. What is the quantity 1 − β called? Solution: The probability of making a type I error (rejecting the null hypothesis when it is true) is α, while β is the probability of making a type II error (failing to reject the null hypothesis when it it false). The probability of rejecting the null hypothesis when it is false is 1 − β, which is known as the power of the test. 2. How can β be made small in a given hypothesis test with fixed α? Solution: Increase the sample size. 3. Using a 5% significance level, what is the power of the test H0 : µ = 100 versus H1 : µ = 100 if a sample of size 36 is taken from a N(120, 50)? Solution: > > > > > > >
alpha Greater90 = 90) > with(data = Greater90, + eda(totalprice) + ) > n 60, 0002 . Step 2: Test Statistic — The test statistic chosen is S 2 because E S 2 = σ 2 . > TS TS [1] 3822980710 The value of this test statistic is s2 = 3822980710.3638. The standardized 2test ∼ statistic under the assumption that H0 is true and its distribution are (n−1)S σ2
χ2n−1 .
0
Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed χ293 and H1 is an upper one-sided hypothesis, the rejection region is χ2obs > χ20.95; 94−1 = 116.511. The value of the standardized test statistic is χ2obs = 98.7603. > RR RR [1] 116.511
K14521_SM-Color_Cover.indd 354
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
349
> STS STS [1] 98.76034
Step 4: Statistical Conclusion — The ℘-value is P(χ293 ≥ 98.7603) = 0.3218. > pvalue pvalue [1] 0.3218218
I. From the rejection region, fail to reject H0 because χ2obs = 98.7603 is less than 116.511. II. From the ℘-value, fail to reject H0 because the ℘-value 0.3218 is greater than 0.05.
Fail to reject H0 . Step 5: English Conclusion — There is insufficient evidence to suggest the variance for the appraised price of 90m2 or larger pisos is greater than 60, 0002 e 2 .
6. The Hubble Space Telescope was put into orbit on April 25, 1990. Unfortunately, on June 25, 1990, a spherical aberration was discovered in Hubble’s primary mirror. To correct this, astronauts had to work in space. To prepare for the mission, two teams of astronauts practiced making repairs under simulated space conditions. Each team of astronauts went through 15 identical scenarios. The times to complete each scenario were recorded in days. Is one team better than the other? If not, can both teams complete the mission in less than 3 days? Use a 5% significance level for all tests. The data are stored in the data frame HUBBLE. Solution: Note that each team of astronauts went through 15 identical scenarios. Consequently, the repair times for the two teams are dependent. Start the analysis by verifying the normality assumption required to use a paired t-test. > Diff eda(Diff)
K14521_SM-Color_Cover.indd 355
30/06/15 11:49 am
350
Probability and Statistics with R, Second Edition: Exercises and Solutions
EXPLORATORY DATA ANALYSIS Histogram of Diff
Density of Diff
Boxplot of Diff
Q−Q Plot of Diff
The results from applying the function eda() to the differences between team1 and team2 suggest it is not unreasonable to assume the repair time differences between team1 and team2 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if the average difference in repair times for team1 and team2 are different, the hypotheses are H0 : µD = 0 versus H1 : µD = 0
Step 2: Test Statistic — The test statistic chosen is D because E D = µD . > dbar dbar [1] -0.1 The value of this test statistic is d¯ = −0.1. The standardized test statistic under √0 the assumption that H0 is true and its distribution are SDD−δ / nD ∼ t15−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is a two-sided hypothesis, the rejection region is |tobs | > t0.975; 14 = 2.1448. > RR RR [1] 2.144787 > TR TR
K14521_SM-Color_Cover.indd 356
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
351
One Sample t-test data: Diff t = -0.25836, df = 14, p-value = 0.7999 alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: -0.9301447 0.7301447 sample estimates: mean of x -0.1
The value of the standardized test statistic is tobs =
¯ 0 d−δ √ sD / n D
= −0.2584.
Step 4: Statistical Conclusion — The ℘-value is 2 × P(t29 ≥ −0.2584) = 0.7999.
I. From the rejection region, fail to reject H0 because |tobs | = 0.2584 is less than 2.1448. II. From the ℘-value, fail to reject H0 because the ℘-value 0.7999 is greater than 0.05.
Fail to reject H0 .
Step 5: English Conclusion — There is not sufficient evidence to suggest the mean difference in repair times is not equal to zero. In other words, there is no evidence to suggest one team is better than the other.
To answer whether both teams can complete the mission in less than three days, start by verifying the normality assumption of the data for team1 using exploratory data analysis (eda()).
> with(data = HUBBLE, + eda(team1) + )
K14521_SM-Color_Cover.indd 357
30/06/15 11:49 am
352
Probability and Statistics with R, Second Edition: Exercises and Solutions
EXPLORATORY DATA ANALYSIS Histogram of team1
Density of team1
Boxplot of team1
Q−Q Plot of team1
The results from applying the function eda() to the repair times for team1 suggest it is not unreasonable to assume the repair times for team1 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test team1 repair time is less than 3 days, the hypotheses are H0 : µ = 3 versus H1 : µ < 3 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 2.22 n
x
i = 2.22. The standardized test statistic The value of this test statistic is x ¯ = i=1 n √ 0 ∼ t15−1 . under the assumption that H0 is true and its distribution are X−µ S/ n
Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 14 = −1.7613. > RR RR [1] -1.76131 > TR TR
One Sample t-test data: team1 t = -3.7753, df = 14, p-value = 0.001024 alternative hypothesis: true mean is less than 3 95 percent confidence interval: -Inf 2.583896 sample estimates: mean of x 2.22
The value of the standardized test statistic is tobs =
x ¯−µ √0 s/ n
= −3.7753.
Step 4: Statistical Conclusion — The ℘-value is P(t14 ≤ −3.7753) = 0.001.
I. From the rejection region, reject H0 because tobs = −3.7753 is less than 1.7613. II. From the ℘-value, reject H0 because the ℘-value 0.001 is less than 0.05.
Reject H0 .
Step 5: English Conclusion — There is evidence that the team1 average mission repair time in less than 3 days.
For team2, start by verifying the normality assumption of the data using exploratory data analysis (eda()).
> with(data = HUBBLE, + eda(team2) + )
K14521_SM-Color_Cover.indd 359
30/06/15 11:49 am
354
Probability and Statistics with R, Second Edition: Exercises and Solutions
EXPLORATORY DATA ANALYSIS Histogram of team2
Density of team2
Boxplot of team2
Q−Q Plot of team2
The results from applying the function eda() to the repair times for team2 suggest it is not unreasonable to assume the repair times for team2 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test team2 repair time is less than 3 days, the hypotheses are H0 : µ = 3 versus H1 : µ < 3 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 2.32 n
x
i = 2.32. The standardized test statistic The value of this test statistic is x ¯ = i=1 n √ 0 ∼ t15−1 . under the assumption that H0 is true and its distribution are X−µ S/ n
Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 14 = −1.7613. > RR RR [1] -1.76131 > TR TR
One Sample t-test data: team2 t = -2.6835, df = 14, p-value = 0.008911 alternative hypothesis: true mean is less than 3 95 percent confidence interval: -Inf 2.766309 sample estimates: mean of x 2.32 The value of the standardized test statistic is tobs =
x ¯−µ √0 s/ n
= −2.6835.
Step 4: Statistical Conclusion — The ℘-value is P(t14 ≤ −2.6835) = 0.0089. I. From the rejection region, reject H0 because tobs = −2.6835 is less than 1.7613. II. From the ℘-value, reject H0 because the ℘-value 0.0089 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence that the team2 average mission repair time in less than 3 days. The evidence suggests both teams can complete the mission in less than 3 days. 7. The research and development department of an appliance company suspects the energy consumption required of their 18-cubic-foot refrigerator can be reduced by a slight modification to the current motor. Sixty 18-cubic-foot refrigerators were randomly selected from the company’s warehouse. The first 30 had their motors modified while the last 30 were left intact. The energy consumption (kilowatts) for a 24-hour period for each refrigerator was recorded and stored in the data frame REFRIGERATOR. Is there evidence that the design modification reduces the refrigerators’ average energy consumption? Solution: To solve this problem, start by verifying the reasonableness of the normality assumption. > ggplot(data = REFRIGERATOR, aes(x = group, y = kilowatts, fill = group)) + + geom_boxplot() + + theme_bw() > ggplot(data = REFRIGERATOR, aes(sample = kilowatts, color = group)) + + stat_qq() + + theme_bw()
K14521_SM-Color_Cover.indd 361
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
2.5
2.5
2.0
2.0
kilowatts
group modified original
1.5
1.0
group
sample
356
modified original
1.5
1.0 modified
group
original
−2
−1
0
theoretical
1
2
The side-by-side boxplots and normal quantile-quantile plots suggest it is reasonable to assume the energy consumption for both models follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test to see if the mean energy consumption for modified refrigerators is less than the mean energy consumption for original refrigerators, use a lower one-sided alternative hypothesis. H0 : µmodified − µoriginal = 0 versus H1 : µmodified − µoriginal < 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > Means Means modified original 1.535800 1.760067 The value of this test statistic is 1.5358 − 1.7601 = −0.2243. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY nX + nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 54.7888 = −1.6731. > TR TR
Welch Two Sample t-test
K14521_SM-Color_Cover.indd 362
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
357
data: kilowatts by group t = -2.5128, df = 54.789, p-value = 0.007475 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -0.07494116 sample estimates: mean in group modified mean in group original 1.535800 1.760067 > RR RR [1] -1.673144 The degrees of freedom are
ν=
s2X nX
(s2X /nX )2 nX −1
+ +
s2Y nY
2
(s2Y /nY )2 nY −1
= 54.7888,
and the value of the standardized test statistic is x ¯ − y¯ − δ0 = −2.5128. tobs = 2 sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is P(t54.7888 ≤ −2.5128) = 0.0075. I. From the rejection region, reject H0 because tobs = −2.5128 is less than 1.6731. II. From the ℘-value, reject H0 because the ℘-value = 0.0075 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest the average energy consumption for modified refrigerators is less than the average energy consumption for unmodified (original) refrigerators.
8. The Yonalasee tennis club has two systems to measure the speed of a tennis ball. The local tennis pros suspects one system (speed1) consistently records faster speeds. To test her suspicions, she sets up both systems and records the speeds of 12 serves (three serves from each side of the court). The values are stored in the data frame TENNIS in the variables speed1 and speed2. The recorded speeds are in kilometers per hour. Does the evidence support the tennis pro’s suspicion? Use α = 0.10. Solution: Note that each system records the same 12 serves. Consequently, the serve times recorded by each system are dependent. Start the analysis by verifying the normality assumption required to use a paired t-test.
K14521_SM-Color_Cover.indd 363
30/06/15 11:49 am
358
Probability and Statistics with R, Second Edition: Exercises and Solutions
> Diff eda(Diff)
EXPLORATORY DATA ANALYSIS Histogram of Diff
Density of Diff
Boxplot of Diff
Q−Q Plot of Diff
The results from applying the function eda() to the differences between speed1 and speed2 suggest it is not unreasonable to assume the serve speed differences between speed1 and speed2 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test the average difference (speed1 - speed2) in recorded speeds, the hypotheses are H0 : µD = 0 versus H1 : µD > 0 Step 2: Test Statistic — The test statistic chosen is D because E D = µD . > dbar dbar [1] -1.329167 > n n [1] 12 The value of this test statistic is d¯ = −1.3292. The standardized test statistic under √0 the assumption that H0 is true and its distribution are SDD−δ / nD ∼ t12−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t11 and H1 is a one-sided hypothesis, the rejection region is tobs > t0.90; 11 = 1.3634.
K14521_SM-Color_Cover.indd 364
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
359
> RR RR [1] 1.36343 > TR TR
One Sample t-test data: Diff t = -0.2804, df = 11, p-value = 0.6078 alternative hypothesis: true mean is greater than 0 95 percent confidence interval: -9.842155 Inf sample estimates: mean of x -1.329167 > # or > with(data = TENNIS, + t.test(speed1, speed2, paired = TRUE, alternative = "greater") + )
Paired t-test data: speed1 and speed2 t = -0.2804, df = 11, p-value = 0.6078 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: -9.842155 Inf sample estimates: mean of the differences -1.329167 The value of the standardized test statistic is tobs =
¯ 0 d−δ √ sD / n D
= −0.2804.
Step 4: Statistical Conclusion — The ℘-value is P(t11 ≥ −0.2804) = 0.6078. I. From the rejection region, fail to reject H0 because tobs = −0.2804 is less than 1.3634. II. From the ℘-value, fail to reject H0 because the ℘-value 0.6078 is greater than 0.10. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to suggest the mean difference between speeds is greater than zero.
K14521_SM-Color_Cover.indd 365
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
360
9. An advertising agency is interested in targeting the appropriate gender for a new “lowfat” yogurt. In a national survey of 1200 women, 825 picked the “low-fat” yogurt over a regular yogurt. Meanwhile, 525 out of 1150 men picked the “low-fat” yogurt over the regular yogurt. Given these results, should the advertisements be targeted at a specific gender? Test the appropriate hypothesis at the α = 0.01 level. Solution: To solve this problem, use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the proportion of women who favor low-fat yogurt is the same as the proportion of men who favor low-fat yogurt are H0 : πX = πY versus H1 : πX = πY . In this case, let the random variable X represent the number of females favoring lowfat yogurt, and let the random variable Y represent the number of males favoring low-fat yogurt. Step 2: Test Statistic — The test statistic chosen is PX −PY since E[PX −PY ] = πX −πY . The standardized test statistic under the assumption that H0 is true is Z=
PX − PY 1 P (1 − P ) m + n1
Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution and H1 is a two-sided hypothesis, the rejection region is |zobs | > z0.995 = 2.5758. > RR RR [1] 2.575829 > > > > > >
x m y n p p
ggplot(data = MILKCARTON, aes(x = size, y = seconds, fill = size)) + + geom_boxplot() + + theme_bw() > ggplot(data = MILKCARTON, aes(sample = seconds, color = size)) + + stat_qq() + + theme_bw()
K14521_SM-Color_Cover.indd 368
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
16
16
14
14
halfgallon wholegallon
10
8
6
size
12
sample
seconds
size
12
363
halfgallon wholegallon
10
8
halfgallon
size
wholegallon
6
−2
−1
0
theoretical
1
2
The side-by-side boxplots and normal quantile-quantile plots suggest it is reasonable to assume the drying times for both half gallon and whole gallon containers follow normal distributions; however, it is clear from the boxplot that the variances are very different. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test to see if the mean drying time for half and whole gallon containers is different, use a two-sided alternative hypothesis. H0 : µX − µY = 0 versus H1 : µX − µY = 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > Means Means halfgallon wholegallon 9.98500 12.19525 The value of this test statistic is 9.985 − 12.1952 = −2.2103. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY + nX nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν and H1 is a two-sided hypothesis, the rejection region is |tobs | > t0.95; 46.0992 = 1.6786. > TR TR
Welch Two Sample t-test data:
K14521_SM-Color_Cover.indd 369
seconds by size
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
364
t = -6.5172, df = 46.099, p-value = 4.796e-08 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -2.892864 -1.527636 sample estimates: mean in group halfgallon mean in group wholegallon 9.98500 12.19525 > RR RR [1] 1.678586 The degrees of freedom are
ν=
s2X nX
(s2X /nX )2 nX −1
+ +
s2Y nY
2
(s2Y /nY )2 nY −1
= 46.0992,
and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −6.5172. sX s2Y + nX nY
Step 4: Statistical Conclusion — The ℘-value is 2 × P(t46.0992 ≥ 6.5172) = 0. I. From the rejection region, reject H0 because |tobs | = 6.5172 is greater than 1.6786. II. From the ℘-value, reject H0 because the ℘-value = 0 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest the average drying times for half and whole gallon containers are not the same.
11. A multinational conglomerate has two textile centers in two different cities. In order to make a profit, each location must produce more than 1000 kilograms of refined wool per day. A random sample of the wool production in kilograms on five different days over the last year for the two locations was taken. The results are stored in the data frame WOOL. Based on the collected data, does the evidence suggest the locations are profitable? Is one location superior to the other? Solution: To see if textileA is profitable, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > woolA with(data = woolA, + eda(production) + )
K14521_SM-Color_Cover.indd 370
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
365
EXPLORATORY DATA ANALYSIS Histogram of production
Density of production
Boxplot of production
Q−Q Plot of production
The results from applying the function eda() to the production of wool suggest it is not unreasonable to assume production of wool for textileA follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if wool production for textileA exceeds 1000 kilograms per day, the hypotheses are H0 : µ = 1 versus H1 : µ > 1 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 1.226 n
x
i The value of this test statistic is x ¯ = i=1 = 1.226. The standardized test n √ 0 ∼ t15−1 . statistic under the assumption that H0 is true and its distribution are X−µ S/ n
Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is an upper one-sided hypothesis, the rejection region is tobs > t0.95; 14 = 1.7613. > RR RR [1] 1.76131
K14521_SM-Color_Cover.indd 371
30/06/15 11:49 am
366
Probability and Statistics with R, Second Edition: Exercises and Solutions > TR TR
One Sample t-test data: production t = 5.1942, df = 14, p-value = 6.801e-05 alternative hypothesis: true mean is greater than 1 95 percent confidence interval: 1.149365 Inf sample estimates: mean of x 1.226
The value of the standardized test statistic is tobs =
x ¯−µ √0 s/ n
= 5.1942.
Step 4: Statistical Conclusion — The ℘-value is P(t14 ≥ 5.1942) = 1e − 04.
I. From the rejection region, reject H0 because tobs = 5.1942 is greater than 1.7613. II. From the ℘-value, reject H0 because the ℘-value = 1e − 04 is less than 0.05.
Reject H0 .
Step 5: English Conclusion — There is evidence to suggest textileA produces more than 1000 kilograms of refined wool per day.
To see if textileB is profitable, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > woolB with(data = woolB, + eda(production) + )
K14521_SM-Color_Cover.indd 372
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
367
EXPLORATORY DATA ANALYSIS Histogram of production
Density of production
Boxplot of production
Q−Q Plot of production
The results from applying the function eda() to the production of wool suggest it is not unreasonable to assume production of wool for textileB follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if wool production for textileB exceeds 1000 kilograms per day, the hypotheses are H0 : µ = 1 versus H1 : µ > 1 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 1.446 n
x
i The value of this test statistic is x ¯ = i=1 = 1.446. The standardized test n √ 0 ∼ t15−1 . statistic under the assumption that H0 is true and its distribution are X−µ S/ n
Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 and H1 is an upper one-sided hypothesis, the rejection region is tobs > t0.95; 14 = 1.7613. > RR RR [1] 1.76131
K14521_SM-Color_Cover.indd 373
30/06/15 11:49 am
368
Probability and Statistics with R, Second Edition: Exercises and Solutions > TR TR
One Sample t-test data: production t = 5.1386, df = 14, p-value = 7.53e-05 alternative hypothesis: true mean is greater than 1 95 percent confidence interval: 1.293129 Inf sample estimates: mean of x 1.446 The value of the standardized test statistic is tobs =
x ¯−µ √0 s/ n
= 5.1386.
Step 4: Statistical Conclusion — The ℘-value is P(t14 ≥ 5.1386) = 1e − 04. I. From the rejection region, reject H0 because tobs = 5.1386 is greater than 1.7613. II. From the ℘-value, reject H0 because the ℘-value = 1e − 04 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest textileB produces more than 1000 kilograms of refined wool per day. To discover if one textile is superior to the other, start by verifying the reasonableness of the normality assumption. > ggplot(data = WOOL, aes(x = location, y = production, fill = location)) + + geom_boxplot() + + theme_bw() > ggplot(data = WOOL, aes(sample = production, color = location)) + + stat_qq() + + theme_bw() 2.00
1.75
1.75
location
1.50
textileA textileB
1.25
1.00
textileA textileB
1.25
1.00
textileA
K14521_SM-Color_Cover.indd 374
location
1.50
sample
production
2.00
location
textileB
−2
−1
0
theoretical
1
2
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
369
The side-by-side boxplots and normal quantile-quantile plots suggest it may be reasonable to assume the wool production for both textile plants follow normal distributions; however, it is clear from the boxplot that the variances are different. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test to see if the mean wool production for the textile plants is different and the problem does not suggest one textile plant is superior to the other, use a two-sided alternative hypothesis. H0 : µX − µY = 0 versus H1 : µX − µY = 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > Means Means textileA textileB 1.226 1.446 The value of this test statistic is 1.226 − 1.446 = −0.22. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY nX + nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν and H1 is a two-sided hypothesis, the rejection region is |tobs | > t0.95; 20.6186 = 1.7222. > TR TR
Welch Two Sample t-test data: production by location t = -2.266, df = 20.619, p-value = 0.03435 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.42213539 -0.01786461 sample estimates: mean in group textileA mean in group textileB 1.226 1.446 > RR RR [1] 1.722211
K14521_SM-Color_Cover.indd 375
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
370
The degrees of freedom are
ν=
s2X nX
(s2X /nX )2 nX −1
+ +
s2Y nY
2
(s2Y /nY )2 nY −1
= 20.6186,
and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −2.266. sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is 2 × P(t20.6186 ≥ 2.266) = 0.0344. I. From the rejection region, reject H0 because |tobs | = 2.266 is greater than 1.7222. II. From the ℘-value, reject H0 because the ℘-value = 0.0344 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest different mean wool production for the two plants.
12. Use the data frame FERTILIZE, which contains the height in inches for plants in the variable height and the fertilization type in the variable fertilization to (a) Test if the data suggest that the average height of self-fertilized plants is more than 17 inches. (Use α = 0.05.) (b) Compute a one-sided 95% confidence interval for the average height of self-fertilized plants (H1 : µ > 17). (c) Compute the required sample size to obtain a power of 0.90 if µ1 = 18 inches assuming that σ = s. (d) What is the power of the test in part (a) if σ = s and µ1 = 18? Solution: (a) To solve this problem, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > > > + +
K14521_SM-Color_Cover.indd 376
SELF 17 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 17.575 The test statistic is x ¯ = 17.575. The standardized test statistic under the assump√ 0 ∼ t15−1 . tion that H0 is true and its distribution are X−µ S/ n Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t14 , and H1 is an upper one-sided hypothesis, the rejection region is tobs > t1−0.05; 14 = t0.95; 14 = 1.7613. > RR RR [1] 1.76131 > TR TR
K14521_SM-Color_Cover.indd 377
30/06/15 11:49 am
372
Probability and Statistics with R, Second Edition: Exercises and Solutions One Sample t-test data: SELF$height t = 1.0854, df = 14, p-value = 0.148 alternative hypothesis: true mean is greater than 17 95 percent confidence interval: 16.64196 Inf sample estimates: mean of x 17.575 The value of the standardized test statistic is tobs =
x ¯−µ √0 s/ n
= 1.0854.
Step 4: Statistical Conclusion — The ℘-value is P(t14 ≥ 1.0854) = 0.148. I. From the rejection region, fail to reject H0 because tobs = 1.0854 is less than 1.7613. II. From the ℘-value, fail to reject H0 because the ℘-value = 0.148 is greater than 0.05. Fail to reject H0 . Step 5: English Conclusion — There is insufficient evidence to suggest that the average height of self-fertilized plants is more than 17 inches. (b) > TR TR One Sample t-test data: SELF$height t = 1.0854, df = 14, p-value = 0.148 alternative hypothesis: true mean is greater than 17 95 percent confidence interval: 16.64196 Inf sample estimates: mean of x 17.575 The one-sided 95% confidence interval for the average height of self-fertilized plants is [16.642, ∞]. (c) > POWER n n [1] 38
K14521_SM-Color_Cover.indd 378
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
373
One needs a sample size of at least 38 to obtain a power of at least 0.90. (d)
> POWER POWER One-sample t test power calculation n delta sd sig.level power alternative
= = = = = =
15 1 2.051676 0.05 0.5598609 one.sided
> power power [1] 0.5598609
The power of the test is 0.5599.
13. A manufacturer of lithium batteries has two production facilities. One facility (A) manufactures a battery with an advertised life of 180 hours, while the second facility (B) manufactures a battery with an advertised life of 200 hours. Both facilities are trying to reduce the variance in their products’ lifetimes. Is the variability in battery life equivalent, or does the evidence suggest the facility producing 200-hour batteries has smaller variability than the facility producing 180-hour batteries? Use the data frame BATTERY with α = 0.05 to test the appropriate hypothesis. Solution: Prior to using a test that is very sensitive to departures in normality, density plots and quantile-quantile normal plots are created for both facilities.
> ggplot(data = BATTERY, aes(lifetime, fill = facility)) + + geom_density() + + theme_bw() > ggplot(data = BATTERY, aes(sample = lifetime, color = facility)) + + stat_qq() + + theme_bw()
K14521_SM-Color_Cover.indd 379
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
200
0.15
density
facility
0.10
A B
facility
sample
374
A
190
B
0.05 180
0.00 180
190
lifetime
200
−2
−1
0
theoretical
1
2
Based on the density plots and quantile-quantile normal plots, it seems reasonable to assume the battery life from both facilities follow normal distributions. Therefore, proceed with the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the variability in facility A’s battery life (X) is greater than the variability in facility B’s battery life (Y ) are 2 2 H0 : σX = σY2 versus H1 : σX > σY2 . 2 2 2 and SY2 since E SX = σX Step 2: Test Statistic — The test statistics chosen are SX 2 2 and E SY = σY . > VAR VAR A B 7.539291 4.347130 The values of these test statistics are s2X = 7.5393 and s2Y = 4.3471. The standardized test statistic under the assumption that H0 is true and its distribution are 2 /SY2 ∼ F50−1,50−1 . SX Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed F49,49 , and H1 is an upper one-sided hypothesis, the rejection region is fobs > F0.95; 49,49 = 1.6073. > RR RR [1] 1.607289 > TR TR
K14521_SM-Color_Cover.indd 380
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
375
F test to compare two variances data: lifetime by facility F = 1.7343, num df = 49, denom df = 49, p-value = 0.02836 alternative hypothesis: true ratio of variances is greater than 1 95 percent confidence interval: 1.079031 Inf sample estimates: ratio of variances 1.734314
The value of the standardized test statistic is fobs = (7.5393)/(4.3471) = 1.7343. Step 4: Statistical Conclusion — The ℘-value is P(F49,49 ≥ 1.7343) = 0.0284. I. From the rejection region, reject H0 because fobs = 1.7343 is greater than 1.6073. II. From the ℘-value, reject H0 because the ℘-value = 0.0284 is less than 0.05. Reject H0 .
Step 5: English Conclusion — The evidence suggests the variability of battery life from facility A is greater than the variance for battery life from facility B.
14. In the construction of a safety strobe, a particular manufacturer can purchase LED diodes from one of two suppliers. It is critical that the purchased diodes conform to their stated specifications with respect to diameter since they must be mated with a fixed width cable. The diameter in millimeters for a random sample of 15 diodes from each of the two suppliers is stored in the data frame LEDDIODE. Based on the data, is there evidence to suggest a difference in variabilities between the two suppliers? Use an α level of 0.01. Solution: Prior to using a test that is very sensitive to departures in normality, density plots and quantile-quantile normal plots are created for both suppliers. > ggplot(data = LEDDIODE, aes(diameter, fill = supplier)) + + geom_density(alpha = 0.3) + + theme_bw() > ggplot(data = LEDDIODE, aes(sample = diameter, color = supplier)) + + stat_qq() + + theme_bw()
K14521_SM-Color_Cover.indd 381
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
376 3
5.4
5.2
density
supplier supplierA supplierB
supplier
sample
2
supplierA
5.0
supplierB
1 4.8
0 4.6
4.8
5.0
diameter
5.2
4.6
5.4
−2
−1
0
theoretical
1
2
Based on the density plots and quantile-quantile normal plots, it seems reasonable to assume the LED diode widths from both suppliers follow normal distributions. Therefore, proceed with the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the variability in LED diode widths using supplier A’s (X) diodes is not equal to the variability in LED diode widths using supplier B’s (Y ) diodes are 2 2 = σY2 . = σY2 versus H1 : σX H0 : σX
2 2 2 = σX and SY2 since E SX Step 2: Test Statistic — The test statistics chosen are SX 2 2 and E SY = σY . > VAR VAR supplierA supplierB 0.06495524 0.01506381 The values of these test statistics are s2X = 0.065 and s2Y = 0.0151. The standardized test statistic under the assumption that H0 is true and its distribution are 2 SX /SY2 ∼ F15−1,15−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed F14,14 , and H1 is a two-sided hypothesis, the rejection region is fobs < F0.005; 14,14 = 0.2326 or fobs > F0.995; 14,14 = 4.2993. > RRupper RRlower c(RRlower, RRupper) [1] 0.2325967 4.2992869 > TR TR
K14521_SM-Color_Cover.indd 382
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
377
F test to compare two variances data: diameter by supplier F = 4.312, num df = 14, denom df = 14, p-value = 0.009861 alternative hypothesis: true ratio of variances is not equal to 1 99 percent confidence interval: 1.002958 18.538551 sample estimates: ratio of variances 4.312006 The value of the standardized test statistic is fobs = (0.065)/(0.0151) = 4.312. Step 4: Statistical Conclusion — The ℘-value is P(F14,14 ≥ 4.312) × 2 = 0.0099. I. From the rejection region, reject H0 because fobs = 4.312 is greater than 4.2993. II. From the ℘-value, reject H0 because the ℘-value = 0.0099 is less than 0.01. Reject H0 . Step 5: English Conclusion — The evidence suggests the variability of the width of LED diodes from supplier A is not equal to the variance for the width of LED diodes from supplier B.
15. The technology at a certain computer manufacturing plant allows silicon sheets to be split into chips using two different techniques. In an effort to decide which technique is superior, 28 silicon sheets are randomly selected from the warehouse. The two techniques of splitting the chips are randomly assigned to the 28 sheets so that each technique is applied to 14 sheets. The results from the experiment are stored in the data frame CHIPS. Use α = 0.05, and test the appropriate hypothesis to see if there are differences between the two techniques. The values recorded in CHIPS are the number of usable chips from each silicon sheet. Solution: To solve this problem, start by verifying the reasonableness of the normality assumption. > ggplot(data = CHIPS, aes(number, fill = method)) + + geom_density(alpha = 0.3) + + theme_bw() > ggplot(data = CHIPS, aes(sample = number, color = method)) + + stat_qq() + + theme_bw()
K14521_SM-Color_Cover.indd 383
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
378
500
0.009
450
density
techniqueI techniqueII
400
method
sample
method
0.006
techniqueI techniqueII
350
0.003 300
0.000
250 250
300
350
400
number
450
500
−1
0
theoretical
1
The density plots and normal quantile-quantile plots suggest it is reasonable to assume the number of usable chips from both techniques follow normal distributions; however, it is clear from the density plots that the variances are different. Now, proceed with the five-step procedure. Step 1: Hypotheses — Since the problem wants to test if there are differences in the mean number of usable chips generated by the two techniques, use a two-sided alternative hypothesis. H0 : µX − µY = 0 versus H1 : µX − µY = 0 Step 2: Test Statistic — The test statistic chosen is X−Y because E X − Y = µX −µY . > MEANS MEANS techniqueI techniqueII 337.6429 360.0714 The value of this test statistic is 337.6429−360.0714 = −22.4286. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY + nX nY Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν , and H1 is a two-sided hypothesis, the rejection region is tobs < t0.025; 18.4541 = −2.0972 or tobs > t0.975; 18.4541 = 2.0972. > TR TR
Welch Two Sample t-test
K14521_SM-Color_Cover.indd 384
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
379
data: number by method t = -1.1175, df = 18.454, p-value = 0.2781 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -64.52203 19.66489 sample estimates: mean in group techniqueI mean in group techniqueII 337.6429 360.0714 > RRupper RRlower c(RRlower, RRupper) [1] -2.097223
2.097223
The degrees of freedom are
ν=
s2X nX
(s2X /nX )2 nX −1
+ +
s2Y nY
2
(s2Y /nY )2 nY −1
= 18.4541,
and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −1.1175. sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is 2 × P(t18.4541 ≥ |−1.1175|) = 0.2781. I. From the rejection region, fail to reject H0 because tobs = −1.1175 is greater than -2.0972. II. From the ℘-value, fail to reject H0 because the ℘-value = 0.2781 is greater than 0.05. Fail to reject H0 . Step 5: English Conclusion — There is insufficient evidence to suggest the average number of usable chips from technique I is different from the average number of usable chips from technique II.
16. Phenylketonuria (PKU) is a genetic disorder that is characterized by an inability of the body to utilize an essential amino acid, phenylalanine. Research suggests patients with phenylketonuria have deficiencies in coenzyme Q10. The data frame PHENYL records the level of Q10 at four different times for 46 patients diagnosed with PKU. The variable Q10.1 contains the level of Q10 measured in µM for the 46 patients. Q10.2, Q10.3, and Q10.4 record the values recorded at later times, respectively, for the 46 patients (Artuch et al., 2004). (a) Normal patients have a Q10 reading of 0.69 µM. Using the variable Q10.2, is there evidence that the mean value of Q10 in patients diagnosed with PKU is less than 0.69 µM? (Use α = 0.01.)
K14521_SM-Color_Cover.indd 385
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
380
(b) Patients diagnosed with PKU are placed on strict vegetarian diets. Some have speculated that patients diagnosed with PKU have low Q10 readings because meats are rich in Q10. Is there evidence that the patients’ Q10 level decreases over time? Construct a 99% confidence interval for the mean difference of the Q10 levels using Q10.1 and Q10.4. Solution: (a) To solve this problem, start by verifying the normality assumption of the data using exploratory data analysis (eda()). > Q10.2 eda(Q10.2)
EXPLORATORY DATA ANALYSIS Histogram of Q10.2
Density of Q10.2
Boxplot of Q10.2
Q−Q Plot of Q10.2
The results from applying the function eda() to variable Q10.2 suggest it is not unreasonable to assume that Q10.2 follows a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if the mean value of Q10.2 is less than 0.69 µM, the hypotheses are H0 : µ = 0.69 versus H1 : µ < 0.69 Step 2: Test Statistic — The test statistic chosen is X because E X = µ. > xbar xbar [1] 0.5165217 The value of this test statistic is x ¯ = 0.5165. The standardized test statistic under √ 0 ∼ t46−1 . the assumption that H0 is true and its distribution are X−µ S/ n
K14521_SM-Color_Cover.indd 386
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
381
Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t45 , and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.01; 45 = −2.4141. > RR RR [1] -2.412116 > TR TR
One Sample t-test data: Q10.2 t = -6.2405, df = 45, p-value = 6.856e-08 alternative hypothesis: true mean is less than 0.69 95 percent confidence interval: -Inf 0.5632078 sample estimates: mean of x 0.5165217
The value of the standardized test statistic is tobs =
x ¯−µ √0 s/ n
= −6.2405.
Step 4: Statistical Conclusion — The ℘-value is P(t45 ≤ −6.2405) = 0. I. From the rejection region, reject H0 because tobs = −6.2405 is less than 2.4121. II. From the ℘-value, reject H0 because the ℘-value = 0 is less than 0.01. Reject H0 . Step 5: English Conclusion — There is evidence to suggest that the mean Q10.2 level is less than 0.69µM.
(b) Note that the problem is solved by comparing the Q10.1 and Q10.4 values for each subject. Consequently, this question is answered using a paired t-test. Start the analysis by verifying the normality assumption required to use a paired t-test. > Diff eda(Diff)
K14521_SM-Color_Cover.indd 387
30/06/15 11:49 am
382
Probability and Statistics with R, Second Edition: Exercises and Solutions
EXPLORATORY DATA ANALYSIS Histogram of Diff
Density of Diff
Boxplot of Diff
Q−Q Plot of Diff
The results from applying the function eda() to the differences between Q10.1 and Q10.4 suggest it is not unreasonable to assume the Q10 differences between Q10.1 and Q10.4 follow a normal distribution. Now, proceed with the five-step procedure. Step 1: Hypotheses — To test if the average difference between Q10.1 and Q10.4 is greater than zero the hypotheses are H0 : µD = 0 versus H1 : µD > 0 Step 2: Test Statistic — The test statistic chosen is D because E D = µD . > dbar dbar [1] 0.1215217 The value of this test statistic is d¯ = 0.1215. The standardized test statistic under √0 the assumption that H0 is true and its distribution are SDD−δ / nD ∼ t46−1 . Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed t45 and H1 is a upper one-sided hypothesis, the rejection region is tobs > t0.99; 45 = 2.4121. > RR RR [1] 2.412116 > TR TR
K14521_SM-Color_Cover.indd 388
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
383
One Sample t-test data: Diff t = 4.3372, df = 45, p-value = 4.02e-05 alternative hypothesis: true mean is greater than 0 95 percent confidence interval: 0.07446716 Inf sample estimates: mean of x 0.1215217 > # Or > t.test(PHENYL$Q10.1, PHENYL$Q10.4, paired = TRUE, + alternative = "greater")
Paired t-test data: PHENYL$Q10.1 and PHENYL$Q10.4 t = 4.3372, df = 45, p-value = 4.02e-05 alternative hypothesis: true difference in means is greater than 0 95 percent confidence interval: 0.07446716 Inf sample estimates: mean of the differences 0.1215217 The value of the standardized test statistic is tobs =
¯ 0 d−δ √ sD / n D
= 4.3372.
Step 4: Statistical Conclusion — The ℘-value is P(t45 ≥ 4.3372) = 0. I. From the rejection region, reject H0 because tobs = 4.3372 is greater than 2.4121. II. From the ℘-value, reject H0 because the ℘-value 0 is less than 0.01. Reject H0 . Step 5: English Conclusion — There is evidence to suggest Q10 levels decrease over time.
> CI CI [1] 0.04616434 0.19687914 attr(,"conf.level") [1] 0.99 The 99% confidence interval for the mean difference of the Q10 levels is [0.0462, 0.1969].
K14521_SM-Color_Cover.indd 389
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
384
17. According to the Pamplona, Spain, registration, 0.4% of immigrants in 2002 were from Bolivia. In June of 2005, a sample of 3740 registered foreigners was randomly selected. Of these, 87 were Bolivians. Is there evidence to suggest immigration from Bolivia has increased? (Use α = 0.05.) Solution: Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether immigration from Bolivia has increased are H0 : π = 0.004 versus H1 : π > 0.004. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of Bolivian immigrants. Provided H0 is true, Y ∼ Bin(n, π0 ). The value of the test statistic is yobs = 87. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — Likelihood Method: n n i n−i π0 (1 − π0 ) ℘-value = P (Y ≥ yobs | H0 ) = i i=y obs
=
3740 87
=0
3740 3740−i 0.004i (0.996) i
Computed with R
> pvalue pvalue [1] 1.505911e-37 > TR TR
Exact binomial test data: 87 and 3740 number of successes = 87, number of trials = 3740, p-value < 2.2e-16 alternative hypothesis: true probability of success is greater than 0.004 95 percent confidence interval: 0.01935316 1.00000000 sample estimates: probability of success 0.02326203
K14521_SM-Color_Cover.indd 390
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
385
Reject H0 . Step 5: English Conclusion — There is evidence to suggest the proportion of Bolivian immigrants in Pamplona, Spain, has increased.
18. Find the power for the hypothesis H0 : µ = 65 versus H1 : µ > 65 if µ1 = 70 at the α = 0.01 level assuming σ = s for the variable hard in the data frame Rubber of the MASS package. Solution: > > + + + + + >
library(MASS) POWER power power [1] 0.4277995 The power for the hypothesis H0 : µ = 65 versus H1 : µ > 65 if µ1 = 70 at the α = 0.01 level assuming σ = 12.1767 for the variable hard in the data frame Rubber of the MASS package is 0.4278. 19. The director of urban housing in Vitoria, Spain, claims that at least 50% of all apartments have more than one bathroom and that at least 75% of all apartments have an elevator. (a) Can the director’s claim about bathrooms be contradicted? Test the appropriate hypothesis using α = 0.10. Note that the number of bathrooms is stored in the variable toilets in the data frame VIT2005. (b) Can the director’s claim about elevators be substantiated using an α level of 0.10? Use both an approximate method as well as an exact method to reach a conclusion. Are the methods in agreement?
K14521_SM-Color_Cover.indd 391
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
386
(c) Test whether the proportion of apartments built prior to 1980 without garages have a smaller proportion with elevators than without elevators. Solution: (a) Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to contradict the housing director’s claim about the bathrooms are H0 : π = 0.50 versus H1 : π < 0.50. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of apartments that have more than one bathroom. Provided H0 is true, Y ∼ Bin(n, π0 ). > FT FT toilets 1 2 116 102 > yobs n c(yobs, n) 2 102 218 The value of the test statistic is yobs = 102. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — Likelihood Method: ℘-value = P (Y ≤ yobs | H0 ) = =
102 218 0
= 0.1893
i
0.5i (0.5)
y obs i=0
n i n−i π (1 − π0 ) i 0
218−i
Computed with R
> pvalue pvalue [1] 0.1893231
K14521_SM-Color_Cover.indd 392
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
387
> TR TR
Exact binomial test data: yobs and n number of successes = 102, number of trials = 218, p-value = 0.1893 alternative hypothesis: true probability of success is less than 0.5 95 percent confidence interval: 0.000000 0.525842 sample estimates: probability of success 0.4678899 Thus, one fails to reject H0 because 0.1893 is greater than 0.1. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to contradict the claim that at least 50% of all apartments have more than one bathroom. (b) Use the five-step procedure. Exact method: Step 1: Hypotheses — The null and alternative hypotheses to support the housing director’s claim about elevators are H0 : π = 0.75 versus H1 : π > 0.75. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of apartments that have an elevator. > FT FT elevator 0 1 44 174 > yobs n c(yobs, n) 1 174 218 Provided H0 is true, Y ∼ Bin(n, π0 ). The value of the test statistic is yobs = 174. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required.
K14521_SM-Color_Cover.indd 393
30/06/15 11:49 am
388
Probability and Statistics with R, Second Edition: Exercises and Solutions
Step 4: Statistical Conclusion — Likelihood Method: n n i n−i π0 (1 − π0 ) ℘-value = P (Y ≥ yobs | H0 ) = i i=y obs
=
218 174
218 218−i 0.75i (0.25) i
= 0.0564
Computed with R
> pvalue pvalue [1] 0.05643458 > TR TR
Exact binomial test data: yobs and n number of successes = 174, number of trials = 218, p-value = 0.05643 alternative hypothesis: true probability of success is greater than 0.75 95 percent confidence interval: 0.748218 1.000000 sample estimates: probability of success 0.7981651 Thus, one rejects H0 because 0.0564 is less than 0.1. Reject H0 . Step 5: English Conclusion — There is evidence to substantiate the claim of the housing director regarding elevators. Approximate method: Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to substantiate the housing director’s claim regarding elevators are H0 : π = 0.75 versus H1 : π > 0.75. Step 2: Test Statistic — The test statistic chosen is P , where P is the proportion of apartments with elevators. Provided H0 is true, π0 (1 − π0 ) P ∼ N π0 , n
K14521_SM-Color_Cover.indd 394
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
389
and the standardized test statistic is P − π0 Z=
π0 (1−π0 ) n
∼ N (0, 1).
Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution, and H1 is an upper one-sided hypothesis, the rejection region is zobs > z0.9 = 1.2816. > RR RR [1] 1.281552 > TRnoCC TRwiCC TRnoCC
= = = =
n, p = 0.75, "greater", correct = FALSE) n, p = 0.75, "greater", correct = TRUE)
1-sample proportions test without continuity correction data: yobs out of n, null probability 0.75 X-squared = 2.6972, df = 1, p-value = 0.05026 alternative hypothesis: true p is greater than 0.75 95 percent confidence interval: 0.7499209 1.0000000 sample estimates: p 0.7981651 > TRwiCC
1-sample proportions test with continuity correction data: yobs out of n, null probability 0.75 X-squared = 2.4465, df = 1, p-value = 0.05889 alternative hypothesis: true p is greater than 0.75 95 percent confidence interval: 0.7474708 1.0000000 sample estimates: p 0.7981651
The value of the standardized test statistic is
K14521_SM-Color_Cover.indd 395
30/06/15 11:49 am
390
Probability and Statistics with R, Second Edition: Exercises and Solutions Without Continuity Correction p − π0 zobs = =
π0 (1−π0 ) n 174 218 − 0.75
(0.75)(1−0.75) 218
= 1.6423
With Continuity Correction p − π0 + zobs =
1 2n
π0 (1−π0 ) n
OR
174
= 218
− 0.75 +
1 436
(0.75)(1−0.75) 218
= 1.5641
Step 4: Statistical Conclusion — The ℘-value is P(Z ≥ 1.6423) = 0.0503 or P(Z ≥ 1.5641) = 0.0589 for continuity corrections not used and used, respectively. I. From the rejection region, reject H0 because zobs = 1.6423 (no continuity correction) is greater than 1.2816, and zobs = 1.5641 (continuity correction) is greater than 1.2816. II. From the ℘-value, reject H0 because the ℘-value = 0.0503 (without continuity correction) or ℘-value = 0.0589 (with continuity correction) is less than 0.1. Reject H0 . Step 5: English Conclusion — There is evidence to support the housing director’s claim about the percent of apartments with elevators. (c) To solve this problem, use Fisher’s exact test and the five-step procedure. Only Fisher’s Exact Test will be completed to answer the question as the n(1 − π) > 10 condition will not be satisfied for a large sample approximation. Step 1: Hypotheses — The null and alternative hypotheses to test whether the proportion of apartments built prior to 1980 with elevators that do not have garages is less than the proportion of apartments that do not have elevators or garages are H0 : πX = πY versus H1 : πX > πY . In this case, the random variable X will represent the number of apartments built prior to 1980 that do not have garages or elevators, and the random variable Y will represent the number of apartments built prior to 1980 that have an elevator but no garage. Step 2: Test Statistic — The test statistic chosen is X, where X is the number of apartments built prior to 1980 that do not have garages or elevators. > FT 25)) > FT garage elevator 0 1 0 19 0 1 22 4
K14521_SM-Color_Cover.indd 396
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
391
Table 9.1: Apartments built prior to 1980 classified by the presence of a garage and of an elevator Garage NO YES NO 19 = x 0 19 = m Elevators YES 22 4 26 = n 41 = k 4 = N − k 45 = N
The observed value of the test statistic is x = 19. Provided H0 is true, and conditioning on the fact that X + Y = k, X ∼ Hyper (m, n, k). Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — To compute the ℘-value, compute min{m,k} m n min{19,41} 19 26 k−i i i 4541−i = 0.1003 = P(X ≥ x | H0 ) = N i=x
k
i=19
41
> pvalue pvalue [1] 0.1003389 > TR TR
Fisher's Exact Test for Count Data data: FT p-value = 0.1003 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.6878007 Inf sample estimates: odds ratio Inf Since the ℘-value is 0.1003, one fails to reject H0 because 0.1003 is greater than 0.10. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to suggest that the proportion of apartments built prior to 1980 with elevators and no garages is lower
K14521_SM-Color_Cover.indd 397
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
392
than the proportion of apartments built prior to 1980 without elevators and no garages.
20. A rule of thumb used by realtors in Vitoria, Spain, is that each square meter will cost roughly e 3000; however, there is some suspicion that this figure is high for apartments in the 55 to 66 m2 range. Use a 5 m2 bracket, that is, [55, 60) (small) and [60, 65) (medium), to see if evidence exists that the average between the medium and small apartment sizes is less than e 15,000. (a) Use the data frame VIT2005 and the variables totalprice and area to test the appropriate hypothesis at a 5% significance level. (b) Are the assumptions for using a t-test satisfied? Explain. (c) Does the answer for (a) differ if the variances are assumed to be equal? Can the hypothesis of equal variances be rejected? Solution: (a) To solve this problem, start by creating a variable aptsize. Then, verify the reasonableness of the normality assumption for the two apartment sizes. > + + + + > >
VIT2005 + + > > >
Large 193
ggplot(data = VITsub, aes(totalprice, fill = aptsize)) + geom_density(alpha = 0.2) + theme_bw() ggplot(data = VITsub, aes(sample = totalprice, color = aptsize)) + stat_qq() + theme_bw() SmallAptPrice xbar ybar c(xbar, ybar) [1] 205083.3 191531.5 The value of this test statistic is 205083.3333 − 191531.5287 = 13551.8046. The standardized test statistic under the assumption that H0 is true and its approximate distribution are X − Y − δ0 ∼ tν . 2 2 SX SY nX + nY
K14521_SM-Color_Cover.indd 399
30/06/15 11:49 am
394
Probability and Statistics with R, Second Edition: Exercises and Solutions
Step 3: Rejection Region Calculations — Because the standardized test statistic is distributed approximately tν , and H1 is a lower one-sided hypothesis, the rejection region is tobs < t0.05; 16.0129 = −1.7458. > TR TR
Welch Two Sample t-test data: MediumAptPrice and SmallAptPrice t = -0.16284, df = 16.013, p-value = 0.4363 alternative hypothesis: true difference in means is less than 15000 95 percent confidence interval: -Inf 29077.34 sample estimates: mean of x mean of y 205083.3 191531.5 > RR RR [1] -1.745797 The degrees of freedom are
ν=
s2X nX
(s2X /nX )2 nX −1
+ +
s2Y nY
2
(s2Y /nY )2 nY −1
= 16.0129,
and the value of the standardized test statistic is x ¯ − y¯ − δ0 tobs = 2 = −0.1628. sX s2Y + nX nY Step 4: Statistical Conclusion — The ℘-value is P(t16.0129 ≤ −0.1628) = 0.4363. I. From the rejection region, fail to reject H0 because tobs = −0.1628 is greater than −1.7458.
II. From the ℘-value, fail to reject H0 because the ℘-value = 0.4363 is greater than 0.05. Fail to reject H0 . Step 5: English Conclusion — There is not sufficient evidence to suggest the average totalprice between the medium and small apartment sizes is less than e 15,000.
(b) From the description of the data, it is not clear if they were obtained as a random sample of all apartments in Vitoria; however, it is reasonable to assume the distribution of totalprice for both medium and small apartments follows a normal distribution.
K14521_SM-Color_Cover.indd 400
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
395
(c) > TR TR Two Sample t-test data: MediumAptPrice and SmallAptPrice t = -0.1601, df = 18, p-value = 0.4373 alternative hypothesis: true difference in means is less than 15000 95 percent confidence interval: -Inf 29237.82 sample estimates: mean of x mean of y 205083.3 191531.5 The answer for (a) is the same if variances are assumed to be equal. > var.test(MediumAptPrice, SmallAptPrice) F test to compare two variances data: MediumAptPrice and SmallAptPrice F = 1.1756, num df = 11, denom df = 7, p-value = 0.8594 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 0.2496275 4.4187049 sample estimates: ratio of variances 1.175613 The hypothesis of equal variance cannot be rejected at the α = 0.05 level. 21. A survey to determine unemployment demographics was administered during the first trimester of 2005 in the Spanish province of Navarra. The numbers of unemployed people according to urban and rural areas and gender follow. Unemployment in Navarra, Spain, Male Female Urban 4734 6161 Rural 3259 4033 Totals 7993 10194
in 2005 Totals 10895 7292 18187
(a) Test to see if there is evidence to suggest that πmale|urban < πfemale|urban at α = 0.05. (b) Use an exact test to see if the evidence suggests πfemale|urban > 0.55. (c) Is there evidence to suggest the unemployment rate for females given that they live in a rural area is greater than 50%? Use α = 0.05 with an exact test to reach a conclusion.
K14521_SM-Color_Cover.indd 401
30/06/15 11:49 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
396
(d) Does evidence suggest that πfemale|urban > πfemale|rural ? Solution: (a) To solve this problem, use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test if there is evidence to suggest that πmale|urban < πfemale|urban are H0 : πX = πY versus H1 : πX < πY . In this case, let the random variable X represent the number males given an urban area, and let the random variable Y represent the number of females given an urban area. Step 2: Test Statistic — The test statistic chosen is PX −PY since E[PX −PY ] = πX −πY . The standardized test statistic under the assumption that H0 is true is Z=
PX − PY 1 P (1 − P ) m + n1
Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution and H1 is a lower one-sided hypothesis, the rejection region is zobs < z0.05 = −1.6449. > RR RR [1] -1.644854 > > > > > >
x m y n p p
TR TR
Exact binomial test data: 6161 and 10895 number of successes = 6161, number of trials = 10895, p-value = 0.0005906 alternative hypothesis: true probability of success is greater than 0.55 95 percent confidence interval: 0.5576192 1.0000000 sample estimates: probability of success 0.5654888 Since the ℘-value is 6e-04, reject H0 . Reject H0 . Step 5: English Conclusion — There is sufficient evidence to suggest the proportion of unemployed females given an urban area is greater than 55%. (c) Use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test whether the proportion of unemployed females given a rural area is greater than 50% are H0 : π = 0.50 versus H1 : π > 0.50. Step 2: Test Statistic — The test statistic chosen is Y , where Y is the number of unemployed females given a rural area. Provided H0 is true, Y ∼ Bin(n, π0 ). The value of the test statistic is yobs = 4033. Step 3: Rejection Region Calculations — Rejection is based on the ℘-value, so none are required. Step 4: Statistical Conclusion — Likelihood Method: > pvalue pvalue [1] 6.48379e-20
K14521_SM-Color_Cover.indd 404
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
399
> TR TR
Exact binomial test data: 4033 and 7292 number of successes = 4033, number of trials = 7292, p-value < 2.2e-16 alternative hypothesis: true probability of success is greater than 0.5 95 percent confidence interval: 0.5434122 1.0000000 sample estimates: probability of success 0.5530719 Since the ℘-value is 0, reject H0 . Reject H0 . Step 5: English Conclusion — There is sufficient evidence to suggest the proportion of unemployed females given a rural area is greater than 50%. (d) To solve this problem, use the five-step procedure. Step 1: Hypotheses — The null and alternative hypotheses to test if there is evidence to suggest that πfemale|urban > πfemale|rural are H0 : πX = πY versus H1 : πX > πY . In this case, let the random variable X represent the number of females given an urban area, and let the random variable Y represent the number of females given a rural area. Step 2: Test Statistic — The test statistic chosen is PX −PY since E[PX −PY ] = πX −πY . The standardized test statistic under the assumption that H0 is true is Z=
PX − PY 1 P (1 − P ) m + n1
Step 3: Rejection Region Calculations — Because the standardized test statistic has an approximate N (0, 1) distribution and H1 is an upper one-sided hypothesis, the rejection region is zobs > z0.95 = 1.6449. > > > > > >
K14521_SM-Color_Cover.indd 405
x y m n p p
RR RR [1] -1.76131 > TR TR
One Sample t-test data: Diff t = -2.5096, df = 14, p-value = 0.0125 alternative hypothesis: true mean is less than 0 95 percent confidence interval: -Inf -9.214292 sample estimates: mean of x -30.90333 The value of the standardized test statistic is tobs = −2.5096.
¯ 0 d−δ √ sD / n D
=
−30.9033−0 √ 47.6925/ 15
=
Step 4: Statistical Conclusion — The ℘-value is P(t14 ≤ −2.5096) = 0.0125. I. From the rejection region, reject H0 because tobs = −2.5096 is less than 1.7613. II. From the ℘-value, reject H0 because the ℘-value = 0.0125 is less than 0.05. Reject H0 . Step 5: English Conclusion — There is evidence to suggest that the mean difference between prices for insurances quotes from company A and company B is less than zero. That is, for similar insurance, company A is less expensive than company B.
K14521_SM-Color_Cover.indd 408
30/06/15 11:49 am
Chapter 9:
Hypothesis Testing
403
The owner of the transportation fleet changed his mind when presented with quotes for the same jobs from each insurance company. In the original 100 jobs, the overall pattern of the companies’ insuring could be seen, but a comparison of similar jobs was not clear. With the paired data, a more reasonable comparison could be made. 23. Environmental monitoring is done in many fashions, including tracking levels of different chemicals in the air, underground water, soil, fish, milk, and so on. It is believed that milk cows eating in pastures where gamma radiation from iodine exceeds 0.3 µGy/h in turn leads to milk with iodine concentrations in excess of 3.7 MBq/m3 . Assuming the distribution of iodine in pastures follows a normal distribution with a standard deviation of 0.015 µGy/h, determine the required sample size to detect a 2% increase in baseline gamma radiation (0.3µGy/h) using an α = 0.05 significance level with probability 0.99 or more. Solution: The null and alternative hypotheses to test whether gamma radiation from iodine exceeds 0.3 µGy/h are H0 : µ = 0.3 versus H1 : µ > 0.3. The power of a test is the probability that the null hypothesis is rejected when it is false. Here, Power(µ1 = 0.3 × 1.02) = P (reject H0 | µ1 = 0.3 × 1.02 ) = P X > 95th percentile of a N 0.3,
R is used in an iterative process to discover the value of n = 99. = P X > 95th percentile of a N 0.3, = 0.9902
> > > > > > > > + + + + >
0.015 √ n
0.015 √ 99
µ1 = 0.306 µ1 = 0.306
alpha > + > > + + + +
mu > + + + +
Probability and Statistics with R, Second Edition: Exercises and Solutions
alpha > + + +
K14521_SM-Color_Cover.indd 412
n
K14521_SM-Color_Cover.indd 413
set.seed(3) n > > + + > > > >
set.seed(3) n > >
set.seed(3) n + + > > >
set.seed(21) alpha + + > > >
set.seed(21) alpha > > + +
K14521_SM-Color_Cover.indd 416
mu > > > > > + +
K14521_SM-Color_Cover.indd 418
mu > > + + + +
K14521_SM-Color_Cover.indd 419
alpha > > > + + + > > + + + + + + +
K14521_SM-Color_Cover.indd 421
set.seed(9) nx > + +
K14521_SM-Color_Cover.indd 423
Ratio > + + + + + + >
K14521_SM-Color_Cover.indd 424
Ratio > + +
K14521_SM-Color_Cover.indd 425
nx > + + + + + + >
K14521_SM-Color_Cover.indd 426
nx + + + + + + >
K14521_SM-Color_Cover.indd 427
nx > > >
K14521_SM-Color_Cover.indd 428
set.seed(9) m > > > + + + > > > >
set.seed(9) m > > >
Probability and Statistics with R, Second Edition: Exercises and Solutions
set.seed(9) m pvalue pvalue [1] 0.5
K14521_SM-Color_Cover.indd 437
30/06/15 11:50 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
432
The statistic s = 6 and the ℘-value is 0.5. There is not sufficient evidence to suggest the median satisfaction score is greater than 7. 7. A Mendebaldea real estate agent claims Mendebaldea, Spain, has larger apartments than those in San Jorge, Spain. A San Jorge real estate agent disputes this claim. To resolve the issue, two random samples of the total area of several apartments (given in m2 ) are taken from each community in 2002 and stored in the data frame APTSIZE. Mendebaldea San Jorge
90 75
92 75
90 53
83 78
85 52
105 90
136 78
75
(a) Is there evidence to support the Mendebaldea agent’s claim? (i) Use an exact procedure. (ii) Use an approximate procedure. (b) Find a confidence interval for the median of Mendebaldea minus the median of San Jorge with a confidence level of at least 0.90. Solution: (a) > ggplot(data = APTSIZE) + + geom_density(aes(x = size, fill = location), alpha = 0.3) + + theme_bw()
0.06
0.04
density
location Mendebaldea SanJorge
0.02
0.00 50
75
100
size
125
Based on the densities, the distributional shapes and skews for the two apartments appear to be different; however, due to the small sample sizes (7 and 8), it is very hard to reject the null
K14521_SM-Color_Cover.indd 438
30/06/15 11:50 am
Chapter 10:
Nonparametric Methods
433
hypothesis that the two cdfs are the same. Consequently, one might assume the distributions are similar and proceed with a Wilcoxon rank-sum test procedure. An alternative approach might be to perform a permutation test. (i) Exact procedure: > library(coin) Loading required package: survival Attaching package: ’coin’ The following object is masked by ’.GlobalEnv’: alpha > wilcox_test(size ~ location, data = APTSIZE, distribution = "exact", + alternative = "greater") Exact Wilcoxon Mann-Whitney Rank Sum Test data: size by location (Mendebaldea, SanJorge) Z = 2.9167, p-value = 0.001088 alternative hypothesis: true mu is greater than 0 > > > >
SJ obsDiff obsDiff Mendebaldea 25.28571 > > > > > + + + + + > >
Size ggplot(data = USJudgeRatings, aes(x = (INTG - DMNR))) + + geom_density(fill = "pink") + + theme_bw() 1.2
density
0.9
0.6
0.3
0.0 0.0
0.5
1.0
(INTG − DMNR)
1.5
2.0
Since the density plot of INTR − DMNR is skewed to the right, use the sign test to test if lawyers are more likely to give a judge high integrity ratings rather than high demeanor ratings. > Dif SIGN.test(Dif, md = 0, alternative = "greater") One-sample Sign-Test data: Dif s = 41, p-value = 4.552e-13 alternative hypothesis: true median is greater than 0 95 percent confidence interval: 0.2563989 Inf sample estimates: median of x 0.4 Conf.Level L.E.pt U.E.pt Lower Achieved CI 0.9369 0.3000 Inf Interpolated CI 0.9500 0.2564 Inf Upper Achieved CI 0.9670 0.2000 Inf Based on the small ℘-value, reject the null hypothesis. The evidence suggests lawyers are more likely to give a judge high integrity ratings rather than high demeanor ratings. (b)
K14521_SM-Color_Cover.indd 443
30/06/15 11:50 am
438
Probability and Statistics with R, Second Edition: Exercises and Solutions
> SIGN.test(Dif, md = 0, conf.level = 0.90) One-sample Sign-Test data: Dif s = 41, p-value = 9.104e-13 alternative hypothesis: true median 90 percent confidence interval: 0.2563989 0.4436011 sample estimates: median of x 0.4 Conf.Level L.E.pt Lower Achieved CI 0.8737 0.3000 Interpolated CI 0.9000 0.2564 Upper Achieved CI 0.9340 0.2000
is not equal to 0
U.E.pt 0.4000 0.4436 0.5000
An interpolated 90% confidence interval for the median differences (INTR − DMNR) is [0.2564, 0.4436]. 10. A company manager is studying the possibility of giving 20 minutes of rest to her employees in a resting room. To check the viability of this proposal, she analyzed 12 random days of productivity where employees took 20 minutes of rest and 12 random days where they did not. The groups’ productivity scores are given in the following table where higher scores represent greater productivity.
With Rest Without Rest
9 7
8 9
8 5
7 6
6 7
7 3
8 9
9 9
7 4
7 5
7 6
6 4
Is there evidence to suggest that taking a rest produces an increase in group productivity? Answer based on the results from a (a) Wilcoxon signed-rank test, (b) t-test, and a (c) Permutation test. Solution: The function eda() is used on the productivity differences. > > > >
rest obsMdif obsMdif [1] 1.25 > > > + + + > >
sims > + + > >
means library(coin) > oneway_test(Time ~ Company, distribution = "exact", + alternative = "less", data = DF) Exact 2-Sample Permutation Test data: Time by Company (American, Japanese) Z = -1.7756, p-value = 0.04365 alternative hypothesis: true mu is less than 0 Note that the ℘-values for (a) and (b) agree. (c) > library(boot) Attaching package: ’boot’ The following object is masked from ’package:survival’: aml The following object is masked from ’package:lattice’: melanoma > > + + + + > > >
set.seed(12) meandiff wilcox.test(extra ~ group, paired = TRUE, data = sleep, correct = FALSE) Warning in wilcox.test.default(x = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, : cannot compute exact p-value with ties Warning in wilcox.test.default(x = c(0.7, -1.6, -0.2, -1.2, -0.1, 3.4, 3.7, : cannot compute exact p-value with zeroes Wilcoxon signed rank test data: extra by group V = 0, p-value = 0.007632 alternative hypothesis: true location shift is not equal to 0 Based on the small ℘-value (0.0076), reject the null hypothesis. The evidence suggests the mean difference between the drugs is not zero. (b) > obsMdif obsMdif [1] -1.58 > > > + + + > >
sims > + + + > >
sims ggplot(data = CIRCUIT, aes(sample = lifetime, shape = design, + color = design)) + + stat_qq() + + theme_bw()
K14521_SM-Color_Cover.indd 454
30/06/15 11:50 am
Chapter 10:
Nonparametric Methods
449
0.8 Design1
0.6 0.4 0.2 0.0 0.8
Design2
0.6 0.4
4 design Design1
sample
0.2
5
3
0.0 0.8 Design3
0.6 0.4 0.2
Design2 Design3 Design4
2
0.0 0.8 Design4
0.6 0.4 0.2 0.0
0
1
2
3
lifetime
4
5
1
0 −1.5
−1.0
−0.5
0.0
theoretical
0.5
1.0
1.5
The density plots and quantile-quantile normal plots make normality questionable; however, ruling out normality with so few observations is difficult. (b) > TR TR Kruskal-Wallis rank sum test data: lifetime by design Kruskal-Wallis chi-squared = 10.245, df = 3, p-value = 0.0166 The ℘-value from the Kruskal-Wallis test of 0.0166 suggests differences exist among the mean lifetimes of different circuit designs. (c) > > > > > + + + > >
N + + + + +
library(MASS) airquality$Month ggplot(data = airquality, aes(sample = Ozone, shape = Month, + color = Month)) + + stat_qq() + + theme_bw()
150
Month 5
sample
100
6 7 8 9
50
0 −2
−1
0
theoretical
1
2
The curvature in the Q-Q normal plots in addition to the density plots and boxplots suggest the distribution of ozone of each month is not normally distributed. (d) > TR TR Kruskal-Wallis rank sum test data: Ozone by Month Kruskal-Wallis chi-squared = 29.267, df = 4, p-value = 6.901e-06 The Kruskal-Wallis test ℘-value of 0 suggests the mean ozone level is not the same for all the months; however, the assumption of similar shapes for the distribution of ozone for each month was questionable. (e) > > > > > + +
K14521_SM-Color_Cover.indd 457
N TR TR Pearson's Chi-squared test data: FT X-squared = 28.102, df = 2, p-value = 7.901e-07 The ℘-value of 0 suggest there is an association between wool and tension. 18. The music industry wants to know if the musical style on a CD influences how many illegal copies of it are sold. To achieve this purpose, the company chooses six cities randomly and writes down the number of illegal CDs available on the street categorized by music type: classical music, flamenco, heavy metal, and pop-rock. The data are shown in the following table.
K14521_SM-Color_Cover.indd 458
30/06/15 11:50 am
Chapter 10:
Nonparametric Methods
453
Musical Style City City City City City City City
1 2 3 4 5 6
Classical
Flamenco
Heavy Metal
Pop-Rock
4 3 2 5 2 9
1 4 1 3 3 1
6 5 8 2 6 2
9 10 14 7 14 6
(a) Create boxplots and density plots of the number of illegal CDs available for each music style. (b) Are the distribution shapes similar? (c) Are there significant differences in the numbers of CDs available according to musical style?
Solution: (a) > + + + + + > + > + > > > + + + + + + > + + + + +
K14521_SM-Color_Cover.indd 459
number + > > > >
number > > > > > >
DFT TRaii TRaii Pearson's Chi-squared test data: TCSmales X-squared = 33.033, df = 2, p-value = 6.714e-08 > TRaiii TRaiii Pearson's Chi-squared test data: TCSfemales X-squared = 115.7, df = 2, p-value < 2.2e-16 Evidence suggests there are associations between class and survival for all of the passengers grouped together (℘-value = 0), men separately (℘-value = 0), and women evaluated separately (℘-value = 0).
K14521_SM-Color_Cover.indd 472
30/06/15 11:50 am
Chapter 10:
Nonparametric Methods
467
27. Mental inpatients in the Virgen del Camino Hospital (Pamplona, Spain) are interviewed by expert psychiatrists to diagnose their illnesses. An important aspect in diagnosis is determining the severity of any delusions a patient might suffer. A new questioning technique has been developed to detect the presence of delusions. The technique assigns a score from 0 to 5, where 5 indicates the presence of strong delusions and 0 indicates no delusions. The psychiatrists wish to know if the new technique actually results in high scores for patients who have previously been diagnosed as suffering from severe delusions. The scores that follow were obtained from randomly selected patients who were known to suffer from delusions and those who were known not to suffer with delusions: Delusions Present Absent
Score 5 1
5 0
4 5
5 0
4 4
5 4
5 0
Do the data provide evidence that the new test yields higher scores for those patients who are known to suffer from delusions than for those who do not suffer from delusions? Solution: Due to the small sample sizes and the discrete nature of the scores, a permutation test is used to see if the new test yields higher scores for patients who are known to suffer from delusions than patients that do not suffer from delusions. > > > > > > > > +
present
Score B set.seed(10) > n > > + + >
Nonparametric Methods
469
xbar + + + + > > >
library(boot) MEAN > > > > +
K14521_SM-Color_Cover.indd 475
B sd(boot100$t) [1] 0.09386073 > PEboot100 PEboot100 [1] 6.139268 > > > > > > + + >
B PEboot1000 PEboot1000 [1] 0.9157565 The percent difference from what the standard error should be decreases as the sample size increases. 30. The “Wisconsin Card Sorting Test” is widely used by psychiatrists, neurologists, and neuropsychologists with patients who have a brain injury, neurodegenerative disease, or a mental illness such as schizophrenia. Patients with any sort of frontal lobe lesion generally do poorly on the test. The data frame WCST and the following table contain the test scores from a group of 50 patients from the Virgen del Camino Hospital (Pamplona, Spain). 23 12 31 8 7 28 25 17 19 42 17 6
19 11 36 94 6 10 22 8 20 47 5 13 28 19 8 6 11 10 19 65 13
7 18 26 35 78 11 7 19 38 8 15 40 17 5 26 15 4
(a) Use the function eda() from the PASWR2 package to explore the data and decide if normality can be assumed. (b) What assumption(s) must be made to compute a 95% confidence interval for the population mean? (c) Compute the confidence interval from (b). (d) Compute a 95% BCa bootstrap confidence interval for the mean test score. (e) Should you use the confidence interval reported in (c) or the confidence interval reported in (d)? Solution: (a) Assuming the variable score has a normal distribution is not reasonable.
K14521_SM-Color_Cover.indd 477
30/06/15 11:50 am
472
Probability and Statistics with R, Second Edition: Exercises and Solutions
> with(data = WCST, + eda(score)) Size (n) 50.000 Max 94.000 SW p-val 0.000
Missing 0.000 Stdev 18.406
Minimum 4.000 Var 338.785
1st Qu 8.500 SE Mean 2.603
Mean 21.480 I.Q.R. 17.500
Median TrMean 3rd Qu 17.000 19.413 26.000 Range Kurtosis Skewness 90.000 4.511 2.033
EXPLORATORY DATA ANALYSIS Histogram of score
Density of score
Boxplot of score
Q−Q Plot of score
(b) In order to construct a 95% confidence interval for the population mean, one assumes that the values in the variable score are taken from a normal distribution. Although this is not a reasonable assumption, the sample size might be sufficiently large to overcome the skewness in the parent population. Consequently, one might appeal to the Central Limit Theorem and claim that the sampling distribution of X is approximately normal due to the sample size (50). In this problem, the skewness is quite severe, and one should not be overly confident in the final interval. (c) > CI CI [1] 16.24904 26.71096 attr(,"conf.level") [1] 0.95 The confidence interval is [16.249, 26.711]. (d)
K14521_SM-Color_Cover.indd 478
30/06/15 11:50 am
Chapter 10: > > + + + > > > > >
Nonparametric Methods
473
library(boot) # use boot package MEAN + >
K14521_SM-Color_Cover.indd 480
set.seed(12) sims
475
select = score, drop = TRUE) for(i in 1:sims){ SC1 >
Nonparametric Methods
2.775
library(boot) MDS >
set.seed(1) block > >
set.seed(1) factor1 > > + > > >
Experimental Design
479
dof > > + > > >
dof resid_matrix resid_matrix [,1] [,2] [,3] [,4] [1,] 3.403931e-15 -1.000000e+00 1.00 -1.735490e-15 [2,] -1.250000e+00 -2.500000e-01 0.75 7.500000e-01 [3,] 1.000000e+00 1.241337e-16 -1.00 1.241337e-16 > Check Check [1,] [2,] [3,]
[,1] [,2] [,3] [,4] 3 2 4 3 2 3 4 4 7 6 5 6
3243 4.08 4.08 4.08 4.08 2 3 4 4 = 4.08 4.08 4.08 4.08 7656 4.08 4.08 4.08 4.08 −1.08 −1.08 −1.08 −1.08 0 −1 1 0 + −0.83 −0.83 −0.83 −0.83 + −1.25 −0.25 0.75 0.75 . 1.92 1.92 1.92 1.92 1 0 −1 0
The estimate of the error variance is the MSE = 0.75. (d)
> checking.plots(model.aov)
K14521_SM-Color_Cover.indd 488
30/06/15 11:51 am
Chapter 11:
Experimental Design
2 2
4
6
8
10
1 0
12
−2
2 −1
0
1
2
Theoretical Quantiles
Standardized residuals versus fitted values for model.aov
Density plot of standardized residuals for model.aov
2
0.30
ordered values
3.0
5
0.20 0.00
2
0.10
−1
0
Density
1
9
−2
standardized residuals
5
−2
5
9
−1
−1
0
1
standardized residuals
9
−2
standardized residuals
2
Normal Q−Q plot of standardized residuals from model.aov
2
Standardized residuals versus ordered values for model.aov
483
3.5
4.0
4.5
5.0
5.5
6.0
fitted values
−3
−2
−1
0
1
2
3
N = 12 Bandwidth = 0.5719
Yes, the assumptions are satisfied for the model in part (a). Specifically, no discernible pattern is seen in the top left graph that would threaten the assumption of independence. The top and bottom right graphs suggest the assumption of normality for the errors is a reasonable assumption. The bottom left graph makes the assumption of constant variance appear reasonable. (e) > barley.mc barley.mc Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ barley, data = DF) $barley diff lwr upr p adj B-A 0.25 -1.459747 1.959747 0.9130865 C-A 3.00 1.290253 4.709747 0.0021919 C-B 2.75 1.040253 4.459747 0.0038641 > plot(barley.mc)
K14521_SM-Color_Cover.indd 489
30/06/15 11:51 am
484
Probability and Statistics with R, Second Edition: Exercises and Solutions
C−B
C−A
B−A
95% family−wise confidence level
−1
0
1
2
3
4
Differences in mean levels of barley
C (SULTANE) is significantly higher than both A (ASPEN) and B (ERIKA) but A (ASPEN) is not significantly different from B (ERIKA). (f) > CO colnames(CO) CO C1 C2 A -1 -1 B 1 -1 C 0 2 > TR TR Df Sum Sq Mean Sq F value Pr(>F) C(barley, CO, 1) 1 0.125 0.125 0.167 0.692633 C(barley, CO, 2) 1 22.042 22.042 29.389 0.000421 *** Residuals 9 6.750 0.750 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Based on the ℘-value (4e-04), evidence suggests there is a difference between SULTANE and the other two varieties ASPEN and ERIKA. 7. As described in Basic Statistics and Data Analysis, Car and Driver (July 1995) conducted tests of five cars from five different countries: Japan’s Acura NSXT, Italy’s Ferrari
K14521_SM-Color_Cover.indd 490
30/06/15 11:51 am
Chapter 11:
Experimental Design
485
F355, Great Britain’s Lotus Esprit S4S, Germany’s Porsche 911 Turbo, and the United States’ Dodge Viper RT/10. The maximum speeds the cars obtained in miles per hour using as much distance as necessary without exceeding the engine’s redline are given:
Acura
Ferrari
Lotus
Porsche
Viper
159.7 161.5 163.7 166.0 157.7 161.7
179.6 173.9 180.2 183.9 176.7 178.4
167.4 163.0 160.3 164.9 160.5 158.3
173.5 182.4 171.3 175.7 179.1 175.0
172.3 168.9 169.5 174.6 161.1 164.2
Data from Kitchens (2003, page 512). (a) What statistical model should be used to analyze this experiment? (b) Conduct an analysis of variance to investigate if differences exist among the maximum speeds of the cars. (c) Use appropriate diagnostic measures to check the adequacy of the model from part (a). (d) What is the mean squared error value for the model from part (a)? (e) Use Tukey’s multiple comparison test to determine which of the cars are different according to speed. Plot the confidence intervals for the mean differences.
Solution: (a) A complete randomized design such as Yij = µ + τi + εij
i = 1, 2, 3, 4, 5,
j = 1, . . . , 6,
εij ∼ N (0, σ)
should be used to analyze the experiment. Before proceeding with formal inferential procedures, the data are examined with the function oneway.plots(). > speed car oneway.plots(Y = speed, fac1 = car)
K14521_SM-Color_Cover.indd 491
30/06/15 11:51 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
486
Acura Ferrari Lotus Porsche Viper
160
165
170
175
180
160
180 175
mean of Y
175
180
Ferrari Porsche
Viper
160
160
165
170
180 175 170
170 speed
speed
165
165
Acura
Ferrari
Lotus Porsche Viper car
Lotus Acura fac1 Main Factor
Based on the output from oneway.plots(), one can see the fastest speeds have been recorded by Ferrari and Porsche, while the slowest speeds have been recorded by Acura and Lotus. (b) > > > > >
DF car.mc car.mc Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = speed ~ car, data = DF) $car Ferrari-Acura Lotus-Acura
K14521_SM-Color_Cover.indd 493
diff 17.0666667 0.6833333
lwr upr p adj 10.6142478 23.519086 0.0000004 -5.7690855 7.135752 0.9978264
30/06/15 11:51 am
488
Probability and Statistics with R, Second Edition: Exercises and Solutions
Porsche-Acura 14.4500000 7.9975811 20.902419 0.0000064 Viper-Acura 6.7166667 0.2642478 13.169086 0.0384079 Lotus-Ferrari -16.3833333 -22.8357522 -9.930914 0.0000008 Porsche-Ferrari -2.6166667 -9.0690855 3.835752 0.7563408 Viper-Ferrari -10.3500000 -16.8024189 -3.897581 0.0006917 Porsche-Lotus 13.7666667 7.3142478 20.219086 0.0000137 Viper-Lotus 6.0333333 -0.4190855 12.485752 0.0749333 Viper-Porsche -7.7333333 -14.1857522 -1.280914 0.0132379 > > > >
opar DF TR TR Paired t-test data: yield by site t = -11.159, df = 9, p-value = 1.426e-06 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -28.92944 -19.17723 sample estimates: mean of the differences -24.05333 Based on the ℘-value = 0, reject H0 and conclude that the mean of the differences is not equal to zero. (b) In this setting, variety is a block. > TR TR Df Sum Sq Mean Sq F value Pr(>F) site 1 2892.8 2892.8 124.523 1.43e-06 *** variety 9 313.5 34.8 1.499 0.278 Residuals 9 209.1 23.2 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Based on the small ℘-value = 0, reject H0 and conclude that the mean of the differences is not equal to zero. (c) When one uses as many blocks with a RCBD design with two treatments as there are paired observations with a two-sided paired t-test, the ℘-value for both procedures will be the same. 9. An insurance company wants to know how its resources are being used with respect to time spent issuing travel insurance policies. The company randomly selects three moments during a day and records the time required to issue a travel insurance policy to three randomly selected clients who take out a travel policy over the phone, over the Internet, and in person. The data obtained (in minutes) are telephone Internet in person
3.49 2.38 2.09 4.38 6.68 5.37 7.91 8.70 8.54
(a) What type of design structure did the company use?
K14521_SM-Color_Cover.indd 495
30/06/15 11:51 am
490
Probability and Statistics with R, Second Edition: Exercises and Solutions
(b) Propose a statistical model to analyze the data.
(c) Comment on any assumptions that need to be made with the model selected in part (a). Check these assumptions.
(d) Test to see if differences exist among the methods used to issue insurance policies.
(e) Estimate the model’s parameters.
(f) How is the standard deviation of the errors estimated?
(g) Write the estimated model in matrix form.
(h) Do the residuals sum to zero?
(i) Use Tukey’s HSD to determine if significant differences exist among methods.
(j) Create a barplot of the mean times, and display the standard errors over their respective means.
Solution: (a) The design structure is a completely randomized design (CRD). (b) Yij = µ + τi + εij
i = 1, 2, 3
j = 1, 2, 3
εij ∼ N (0, σ)
(c) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function. > > + > > > >
K14521_SM-Color_Cover.indd 496
time MSE MSE [1] 0.6838333 > sde sde [1] 0.8269422 An estimate of the standard deviation of the errors is the square root of the mean squared error (0.8269). (g) > EFF EFF 1 2 3 4 5 6 7 8 9
(Intercept) 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444 5.504444
treatment -2.85111111 -2.85111111 -2.85111111 -0.02777778 -0.02777778 -0.02777778 2.87888889 2.87888889 2.87888889
Residuals 0.8366667 -0.2733333 -0.5633333 -1.0966667 1.2033333 -0.1066667 -0.4733333 0.3166667 0.1566667
> MeanMat MeanMat [,1] [,2] [,3] [1,] 5.504444 5.504444 5.504444 [2,] 5.504444 5.504444 5.504444 [3,] 5.504444 5.504444 5.504444 > TreatMat TreatMat [,1] [,2] [,3] [1,] -2.85111111 -2.85111111 -2.85111111 [2,] -0.02777778 -0.02777778 -0.02777778 [3,] 2.87888889 2.87888889 2.87888889
K14521_SM-Color_Cover.indd 498
30/06/15 11:51 am
Chapter 11:
Experimental Design
493
> ResidMat ResidMat [,1] [,2] [,3] [1,] 0.8366667 -0.2733333 -0.5633333 [2,] -1.0966667 1.2033333 -0.1066667 [3,] -0.4733333 0.3166667 0.1566667 > Values Values [,1] [,2] [,3] [1,] 3.49 2.38 2.09 [2,] 4.38 6.68 5.37 [3,] 7.91 8.70 8.54 (h) The residuals sum to zero. > sum(resid(insurance.aov)) [1] -1.387779e-16 > # Or > sum(ResidMat) [1] -1.387779e-16 (i) > TukeyHSD(insurance.aov) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = time ~ treatment, data = DF) $treatment diff lwr upr p adj internet-in person -2.906667 -4.978352 -0.8349817 0.0119941 telephone-in person -5.730000 -7.801685 -3.6583151 0.0003595 telephone-internet -2.823333 -4.895018 -0.7516484 0.0137052 The three methods of issuing insurance are all significantly different. (j) > library(plyr) > mdf mdf treatment MeanTreat SE 1 in person 8.383333 0.2411316
K14521_SM-Color_Cover.indd 499
30/06/15 11:51 am
494
Probability and Statistics with R, Second Edition: Exercises and Solutions
2 internet 3 telephone
5.476667 0.6660914 2.653333 0.4266276
> ggplot(data = mdf, aes(x = treatment, y = MeanTreat, fill = treatment)) + + geom_bar(stat = "identity") + + geom_errorbar(aes(ymin = MeanTreat - SE, ymax = MeanTreat + SE), + width = 0.25) + + guides(fill = FALSE) + + labs(x = "", y = "Mean Time to Issue Policy (in minutes)", + title = "Mean Time to Issue Policy \n with Individual Standard Errors") + + theme_bw() Mean Time to Issue Policy with Individual Standard Errors
Mean Time to Issue Policy (in minutes)
7.5
5.0
2.5
0.0 in person
internet
telephone
10. A health-conscious pizza parlor is attempting to specify the added calories for each ingredient of its medium size pizza. Specifically, the pizza parlor wants to know if there is more variability in an olive topping due to olive suppliers or due to the olives themselves. From numerous suppliers, four are selected randomly and the calories for a pizza topping of olives are recorded for five randomly selected pizzas. The data obtained are given in the following table: Supplier Supplier Supplier Supplier
1 2 3 4
133 124 127 150
136 137 126 141
142 125 130 155
135 132 120 150
134 131 123 157
(a) Specify a statistical model to analyze these data. (b) Conduct an ANOVA.
K14521_SM-Color_Cover.indd 500
30/06/15 11:51 am
Chapter 11:
Experimental Design
495
(c) Estimate the variance components and the total variability of the data. (d) Interpret the results. Solution: (a) The model is Yij = µ + τi + εij ,
τi ∼ N (0, σα ),
εij ∼ N (0, σ)
(b) > + + + > + > > > > >
calories >
MST > + + + + +
K14521_SM-Color_Cover.indd 502
pulpbright > > >
MST > > > > > > >
1 n
·
MST f1−α/2; a−1,N −a MSE
− 1 and U =
1 n
·
MST fα/2; a−1,N −a MSE
dfn model.tables(appliance.aov, type = "effects") Tables of effects cycle cycle Prewash -5.888
Short -3.037
Medium 2.993
Long 5.933
machine machine Machine 1 Machine 2 Machine 3 Machine 4 Machine 5 -0.435 -11.160 5.490 9.615 -3.510 > EFF EFF 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
(Intercept) 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975 21.3975
cycle -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325 -5.8875 -3.0375 2.9925 5.9325
machine Residuals -0.435 0.375 -0.435 2.025 -0.435 -0.855 -0.435 -1.545 -11.160 -1.200 -11.160 -0.900 -11.160 0.570 -11.160 1.530 5.490 -0.900 5.490 -1.800 5.490 2.220 5.490 0.480 9.615 0.075 9.615 -0.825 9.615 -0.855 9.615 1.605 -3.510 1.650 -3.510 1.500 -3.510 -1.080 -3.510 -2.070
> GMmat GMmat [,1] [,2] [,3] [,4] [1,] 21.3975 21.3975 21.3975 21.3975 [2,] 21.3975 21.3975 21.3975 21.3975
K14521_SM-Color_Cover.indd 509
30/06/15 11:51 am
504
Probability and Statistics with R, Second Edition: Exercises and Solutions
[3,] 21.3975 21.3975 21.3975 21.3975 [4,] 21.3975 21.3975 21.3975 21.3975 [5,] 21.3975 21.3975 21.3975 21.3975 > CYCLEmat CYCLEmat
[1,] [2,] [3,] [4,] [5,]
[,1] -5.8875 -5.8875 -5.8875 -5.8875 -5.8875
[,2] -3.0375 -3.0375 -3.0375 -3.0375 -3.0375
[,3] 2.9925 2.9925 2.9925 2.9925 2.9925
[,4] 5.9325 5.9325 5.9325 5.9325 5.9325
> MACHINEmat MACHINEmat [,1] [,2] [,3] [,4] [1,] -0.435 -0.435 -0.435 -0.435 [2,] -11.160 -11.160 -11.160 -11.160 [3,] 5.490 5.490 5.490 5.490 [4,] 9.615 9.615 9.615 9.615 [5,] -3.510 -3.510 -3.510 -3.510 > RESIDUALmat RESIDUALmat [,1] [,2] [,3] [,4] [1,] 0.375 2.025 -0.855 -1.545 [2,] -1.200 -0.900 0.570 1.530 [3,] -0.900 -1.800 2.220 0.480 [4,] 0.075 -0.825 -0.855 1.605 [5,] 1.650 1.500 -1.080 -2.070 > VALUES VALUES
[1,] [2,] [3,] [4,] [5,]
[,1] 15.45 3.15 20.10 25.20 13.65
[,2] 19.95 6.30 22.05 27.15 16.35
[,3] 23.10 13.80 32.10 33.15 19.80
[,4] 25.35 17.70 33.30 38.55 21.75
> matrix(apply(EFF, 1, sum), byrow = TRUE, nrow = 5)
[1,] [2,] [3,] [4,] [5,]
K14521_SM-Color_Cover.indd 510
[,1] 15.45 3.15 20.10 25.20 13.65
[,2] 19.95 6.30 22.05 27.15 16.35
[,3] 23.10 13.80 32.10 33.15 19.80
[,4] 25.35 17.70 33.30 38.55 21.75
30/06/15 11:51 am
Chapter 11:
Experimental Design
505
> xtabs(time ~ machine + cycle, data = DF) machine Machine Machine Machine Machine Machine
cycle Prewash 1 15.45 2 3.15 3 20.10 4 25.20 5 13.65
Short Medium Long 19.95 23.10 25.35 6.30 13.80 17.70 22.05 32.10 33.30 27.15 33.15 38.55 16.35 19.80 21.75
> apply(EFF[,2:4], 2, sum) cycle -4.884981e-15
machine 4.440892e-15
# Check that constraints sum to zero Residuals 2.220446e-16
The constraints for this model are satisfied since the sum of the estimated parameters for τi , βj , and εi,j are all zero. Note that the object EFF has estimates of µ, τi , βj , and εi,j under the headings (Intercept), cycle, machine, and Residuals, respectively. (i) > TukeyHSD(appliance.aov, which = "machine") Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = time ~ cycle + machine, data = DF) $machine Machine Machine Machine Machine Machine Machine Machine Machine Machine Machine
2-Machine 3-Machine 4-Machine 5-Machine 3-Machine 4-Machine 5-Machine 4-Machine 5-Machine 5-Machine
diff lwr upr 1 -10.725 -14.6234199 -6.8265801 1 5.925 2.0265801 9.8234199 1 10.050 6.1515801 13.9484199 1 -3.075 -6.9734199 0.8234199 2 16.650 12.7515801 20.5484199 2 20.775 16.8765801 24.6734199 2 7.650 3.7515801 11.5484199 3 4.125 0.2265801 8.0234199 3 -9.000 -12.8984199 -5.1015801 4 -13.125 -17.0234199 -9.2265801
p adj 0.0000119 0.0030033 0.0000232 0.1514411 0.0000001 0.0000000 0.0003323 0.0364514 0.0000704 0.0000014
All machines are significantly different from one another with the exceptions of Machine 5 and Machine 1 as well as Machine 4 and Machine 3. (j) Recall that there are (a − 1) orthogonal contrasts for a treatments. In this case, since there are 5 washing machines, there are are 4 orthogonal contrasts. > > > > > >
K14521_SM-Color_Cover.indd 511
contrasts(DF$machine)[ , 1] contrasts(DF$machine)[ , 2] contrasts(DF$machine)[ , 3] contrasts(DF$machine)[ , 4] CO TR TR C(machine, CO, C(machine, CO, C(machine, CO, C(machine, CO, cycle Residuals --Signif. codes:
1) 2) 3) 4)
Df Sum Sq Mean Sq F value Pr(>F) 1 863.2 863.2 288.527 9.30e-10 *** 1 104.6 104.6 34.957 7.11e-05 *** 1 69.8 69.8 23.345 0.000411 *** 1 0.9 0.9 0.316 0.584226 3 440.2 146.7 49.045 5.21e-07 *** 12 35.9 3.0
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that Contrast 3 has a ℘-value of 4e-04, which suggests that the mean washing time of machines 2, 3, and 4 is significantly different from the mean washing time of machine 5. (k) > library(plyr) # load package > mdf mdf 1 2 3 4 5
machine MeanMachine Machine 1 20.9625 Machine 2 10.2375 Machine 3 26.8875 Machine 4 31.0125 Machine 5 17.8875
> MSE MSE [1] 2.99175 > ME ME [1] 1.884311 > ggplot(data = mdf, aes(x = machine, y = MeanMachine, fill = machine)) + + geom_bar(stat = "identity") +
K14521_SM-Color_Cover.indd 512
30/06/15 11:51 am
Chapter 11: + + + + + +
Experimental Design
507
geom_errorbar(aes(ymin = MeanMachine - ME, ymax = MeanMachine + ME), width = 0.25) + guides(fill = FALSE) + labs(x = "", y = "Mean Wash Time (in minutes)", title = "Mean Washing Time by Machine \n with Individual 95% CIs") + theme_bw() Mean Washing Time by Machine with Individual 95% CIs
Mean Wash Time (in minutes)
30
20
10
0 Machine 1
Machine 2
Machine 3
Machine 4
Machine 5
13. The Environmental Protection Agency (EPA) is interested in the fuel consumption of older vehicles. An experiment is designed where the gallons of gasoline consumed by vehicles over six years old are measured when the same driver travels 162.78 miles from Boone, North Carolina, to Durham, North Carolina, in 35 different vehicles. Seven vehicles are randomly selected from each category to be tested. The categories are compact, station wagon, minivan, van, and full-size pickup truck. The data obtained (gallons consumed) are given in the following table: Compact Station Wagon Minivan Van Pickup Truck
4.35 5.47 9.37 8.61 20.09
4.96 4.82 6.35 5.33 7.43 8.40 8.66 10.12 14.93 13.38
4.62 6.25 6.76 8.06 16.53
4.32 5.44 8.62 9.31 13.79
4.70 5.73 7.53 6.75 12.44
4.82 5.64 7.54 8.14 14.73
(a) Based on the described randomization, what type of design structure did the EPA use? (b) Propose a statistical model to analyze these data. (c) Are the assumptions for the model specified in part (b) satisfied? If the assumptions for the model specified in part (b) are not satisfied, suggest fixes before advancing to the next question.
K14521_SM-Color_Cover.indd 513
30/06/15 11:51 am
508
Probability and Statistics with R, Second Edition: Exercises and Solutions
(d) Are there significant differences between the fuel consumption for the five types of vehicles? (e) Estimate the model’s error variance. (f) What conclusions can be drawn from the data? Solution: (a) The design structure used by the EPA is a completely randomized design (CRD). (b) The model to use is Yij = µ + τi + εij
i = 1, 2, 3, 4, 5
εij ∼ N (0, σ).
j = 1, . . . , 7,
(c) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function. > + + + + > + > > > >
fuel TR TR Df Sum Sq Mean Sq F value Pr(>F) vehicle 4 2362.5 590.6 138.1 + > > + > > + +
K14521_SM-Color_Cover.indd 517
resingrams ggplot(data = pines, aes(x = shape, y = resingrams, + colour = acidtreatment, group = acidtreatment,
K14521_SM-Color_Cover.indd 518
30/06/15 11:51 am
Chapter 11: + + + + +
Experimental Design
513
linetype = acidtreatment)) + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean, geom = "line") + theme_bw() + labs(x = "", y = "Resin in grams") 100
100
75 shape Check Circle Diagonal
50
Rectangle
Resin in grams
Resin in grams
75
25
acidtreatment Acid Control
50
25
Acid
Control
Check
Circle
Diagonal
Rectangle
The parallel lines in both plots corroborates the assumption of no interaction between the factors acid treatment and hole shape. (d) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function. > checking.plots(pines.aov)
10
15
0
1
12
−1
standardized residuals
21
11
−2
2 1 0 −1
5
20
−2
−1
0
1
2
ordered values
Theoretical Quantiles
Standardized residuals versus fitted values for pines.aov
Density plot of standardized residuals for pines.aov
0.3 0.2 0.1
−1
0
Density
1
2
12 21
20
40
60
fitted values
80
0.0
11
−2
standardized residuals
21
11
−2
standardized residuals
12
Normal Q−Q plot of standardized residuals from pines.aov 2
Standardized residuals versus ordered values for pines.aov
100
−3
−2
−1
0
1
2
3
N = 24 Bandwidth = 0.4869
The assumption of constant variance is a little questionable but the other assumptions appear to be satisfied. (e)
K14521_SM-Color_Cover.indd 519
30/06/15 11:51 am
514
Probability and Statistics with R, Second Edition: Exercises and Solutions
> EFF EFF Tables of effects acidtreatment acidtreatment Acid Control 7.375 -7.375 shape shape Check 14.37
Circle -44.96
Diagonal Rectangle -1.13 31.71
acidtreatment:shape shape acidtreatment Check Circle Diagonal Rectangle Acid 0.625 -5.042 0.792 3.625 Control -0.625 5.042 -0.792 -3.625 Estimates for αi and βj are (7.375, -7.375) and (14.375, -44.9583, -1.125, 31.7083), respectively. (f) > pinesNI.aov TR TR Df Sum Sq Mean Sq F value Pr(>F) acidtreatment 1 1305 1305 25.87 6.56e-05 *** shape 3 19407 6469 128.20 8.71e-13 *** Residuals 19 959 50 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The completely additive model suggests both factors acidtreatment and shape are significant based on the ℘-values 1e-04 and 0, respectively. (g) > CI CI Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = resingrams ~ acidtreatment + shape, data = pines) $shape Circle-Check
K14521_SM-Color_Cover.indd 520
diff lwr upr p adj -59.33333 -70.865641 -47.801025 0.0000000
30/06/15 11:51 am
Chapter 11:
Experimental Design
Diagonal-Check -15.50000 -27.032308 Rectangle-Check 17.33333 5.801025 Diagonal-Circle 43.83333 32.301025 Rectangle-Circle 76.66667 65.134359 Rectangle-Diagonal 32.83333 21.301025
-3.967692 28.865641 55.365641 88.198975 44.365641
515 0.0063690 0.0023646 0.0000000 0.0000000 0.0000009
The mean resin collected for rectangular shaped holes is significantly greater than the mean resin collected for check, circular, and diagonal shapes. 15. The data stored in COWS were extracted from a Canadian record book of purebred dairy cattle. Random samples of 10 mature (five-year-old and older) and 10 two-year-old cows were taken from each of five breeds. The average butterfat percentage of these 100 cows is stored in the variable butterfat, with the type of cow stored in the variable breed and the age of the cow stored in the variable age. (a) Create a two-way ANOVA table. (b) Analyze the residuals and comment on whether the two-factorial model with interaction fits the data. (c) If there are problems that might be remedied with a transformation, suggest an appropriate transformation and reanalyze the new model. (d) Create a graphical display of the interactions for the model selected in (c). Is there significant interaction between breed and age? (e) Based on the model selected in (c), compute group means and parameter estimates to fill in a table similar to Table 11.18. (f) Using αe = 0.05, which breed has the highest average butterfat percentage? Solution: (a) > cows.aov summary(cows.aov) Df Sum Sq Mean Sq F value Pr(>F) age 1 0.21 0.207 1.172 0.282 breed 4 34.32 8.580 48.595 checking.plots(cows.aov)
K14521_SM-Color_Cover.indd 521
30/06/15 11:51 am
516
Probability and Statistics with R, Second Edition: Exercises and Solutions Standardized residuals versus ordered values for cows.aov 99
96 0
20
40
60
80
2 1 0 −1
standardized residuals
−3
2 1 0 −1 −3
standardized residuals
3
8299
3
82
Normal Q−Q plot of standardized residuals from cows.aov
96
100
−3
−2
−1
0
1
2
3
ordered values
Theoretical Quantiles
Standardized residuals versus fitted values for cows.aov
Density plot of standardized residuals for cows.aov 0.4 0.3
Density 96 4.0
4.5
5.0
fitted values
0.0
0.1
0.2
2 1 0 −1 −3
standardized residuals
3
8299
−4
−2
0
2
4
N = 100 Bandwidth = 0.305
There appears to be an increasing variance with larger butterfat values. The model does not satisfy the homogeneity of variance assumption. (c) The boxCox() function from the car package is used on the object cows.aov. > boxCox(cows.aov, lambda = seq(-3, 0, length = 300)) Error in eval(expr, envir, enclos):
could not find function "boxCox"
The 95% confidence interval for λ extends from roughly -2.4 to -0.5. Since a transformation using λ = −1 is inside the 95% confidence interval as well as being a monotonic transformation, the decision to use λ = −1 is made. > cowsTI.aov TR TR Df age 1 breed 4 age:breed 4 Residuals 90 --Signif. codes:
Sum Sq 0.00035 0.08797 0.00076 0.03266
Mean Sq F value Pr(>F) 0.000355 0.977 0.326 0.021993 60.599 checking.plots(cowsTI.aov)
K14521_SM-Color_Cover.indd 522
30/06/15 11:51 am
Chapter 11:
Experimental Design
Standardized residuals versus ordered values for cowsTI.aov
60
80
3 2 1 0
standardized residuals 40
9648
100
65 −3
−2
−1
0
1
2
3
Theoretical Quantiles
Standardized residuals versus fitted values for cowsTI.aov
Density plot of standardized residuals for cowsTI.aov 0.4
ordered values
48
96
0.2 0.1
−3 −2 −1
0
Density
1
0.3
2
3
20
−3 −2 −1
2 1
65 0
65 0.20
0.22
0.24
0.26
0.0
standardized residuals
Normal Q−Q plot of standardized residuals from cowsTI.aov
96
0 −3 −2 −1
standardized residuals
3
48
517
0.28
−2
fitted values
0
2
4
N = 100 Bandwidth = 0.3601
After the butterfat values have been transformed, increasing variance no longer appears problematic. Although there are five observations whose standardized residuals are greater in absolute value than two, this is to be expected with one hundred observations. (d) > ggplot(data = COWS, aes(x = age, y = butterfat^-1, colour = breed, + group = breed, linetype = breed)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw() + + labs(x = "", y = expression(butterfat^{-1})) > ggplot(data = COWS, aes(x = breed, y = butterfat^-1, colour = age, + group = age, linetype = age)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw() + + labs(x = "", y = expression(butterfat^{-1})) 0.28
0.28
0.26
0.26 breed
butterfat−1
Canadian Guernsey Holstein−Friesian Jersey
0.22
0.20
age 2 years old Mature
0.22
0.20
2 years old
K14521_SM-Color_Cover.indd 523
0.24
butterfat−1
Ayrshire
0.24
Mature
Ayrshire
Canadian
Guernsey Holstein−Friesian
Jersey
30/06/15 11:51 am
518
Probability and Statistics with R, Second Edition: Exercises and Solutions
The interaction plots show relatively parallel lines suggesting age and breed do not interact which is corroborated with the interaction ℘-value = 0.7169 computed in (d). (e) > model.tables(cowsTI.aov, type = "means") Tables of means Grand mean 0.2285625 age age 2 years old 0.22668
Mature 0.23045
breed breed Ayrshire 0.24730 Jersey 0.19121
Canadian 0.22671
Guernsey Holstein-Friesian 0.20388 0.27372
age:breed breed age Ayrshire Canadian Guernsey Holstein-Friesian Jersey 2 years old 0.24426 0.22282 0.19941 0.27691 0.18999 Mature 0.25034 0.23059 0.20835 0.27054 0.19242 > model.tables(cowsTI.aov, type = "effects") Tables of effects age age 2 years old -0.0018833
Mature 0.0018833
breed breed Ayrshire 0.01874 Jersey -0.03736
Canadian -0.00186
Guernsey Holstein-Friesian -0.02468 0.04516
age:breed breed age Ayrshire Canadian Guernsey Holstein-Friesian Jersey 2 years old -0.001156 -0.002000 -0.002586 0.005069 0.000673 Mature 0.001156 0.002000 0.002586 -0.005069 -0.000673
K14521_SM-Color_Cover.indd 524
30/06/15 11:51 am
Chapter 11:
Experimental Design
519
Table 11.1: Group means and parameter estimates for the object cowsTI.aov Breed
2 year old Age mature Y •j• ˆ = β j
α ˆ i = Y i••
Ayrshire
Canadian
Guernsey
H-F
Jersey
Y 11• = 0.2443
Y 12• = 0.2228
Y 13• = 0.1994
Y •1• = 0.2473
Y •2• = 0.2267 ˆ = −0.00186 β 2
Y •3• = 0.2039 Y •4• = 0.2737 Y •5• = 0.1912 ˆ = −0.02468 β ˆ = 0.04516 β ˆ = −0.03736 β 3 4 5
Y i••
−Y •••
Y 14• = 0.2769 Y 15• = 0.1900 Y 1•• = α ˆ1 = α β 11 = −0.0011 α β 12 = −0.0020 α β 13 = −0.0025 α β 14 = 0.0051 α β 15 = 0.0007 0.2267 −0.001883 Y 21• = 0.2503 Y 22• = 0.2306 Y 23• = 0.2084 Y 24• = 0.2705 Y 25• = 0.1924 Y 2•• = α ˆ2 = 0.001883 α β 21 = 0.0011 α β 22 = 0.0020 α β 23 = 0.0026 α β 24 = 0.0051 α β 25 = −.0007 0.2304 ˆ = 0.1874 β 1
Y •j• − Y •••
Y ••• = 0.2286
Using the results from the function model.tables(), Table 11.1 is created. (f) To determine which breeds have higher butterfat production, Tukey’s HSD pairwise confidence intervals are created using the function TukeyHSD(). > CIt CIt Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = I(butterfat^-1) ~ age * breed, data = COWS) $breed Canadian-Ayrshire Guernsey-Ayrshire Holstein-Friesian-Ayrshire Jersey-Ayrshire Guernsey-Canadian Holstein-Friesian-Canadian Jersey-Canadian Holstein-Friesian-Guernsey Jersey-Guernsey Jersey-Holstein-Friesian
diff -0.02059248 -0.04341678 0.02642525 -0.05609240 -0.02282430 0.04701773 -0.03549992 0.06984203 -0.01267562 -0.08251765
lwr -0.037363459 -0.060187757 0.009654271 -0.072863376 -0.039595272 0.030246756 -0.052270892 0.053071053 -0.029446594 -0.099288622
upr -0.003821510 -0.026645808 0.043196220 -0.039321427 -0.006053323 0.063788705 -0.018728942 0.086613003 0.004095355 -0.065746673
p adj 0.0082292 0.0000000 0.0002965 0.0000000 0.0024803 0.0000000 0.0000006 0.0000000 0.2274154 0.0000000
At the αe = 0.05 all breeds are significantly different from one another with the exception of Jersey and Guernsey. 16. Photosynthesis in aquatic plants is often inhibited due to the salinity of the water. Some plants such as Cymodocea nodosa seagrass appear to thrive in waters with high salinity. To determine the stress of Cymodocea nodosa seagrass seedlings in four levels of salinity (05PSU, 11PSU, 18PSU, and 36PSU), with two levels of spermidine (NO, YES), plant stress was determined by taking the ratio of Fv /Fm of four vessels each with two deceased nodosa seagrass seedlings that were randomly assigned to the eight treatments. Fv is the variable fluorescence, and Fm is the maximal fluorescence. The ratio Fv /Fm is stored under the variable name fluorescence in the SEAGRASS.csv file. The treatment structure is a 2 × 4 factorial experiment with 32 experimental units, where an experimental
K14521_SM-Color_Cover.indd 525
30/06/15 11:51 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
520
unit is a vessel containing two Cymodocea nodosa seagrass seedlings. The data stored at https://raw.github.com/alanarnholt/Data/master/SEAGRASS.csv is part of a larger study by Elso et al. (2012). Plants not under stress typically have Fv /Fm values between 0.7 and 0.8. Salinity for this study was recorded in practical salinity units (PSU), where 36PSU corresponds to typical ocean salinity. (a) Download the SEAGRASS.csv file using the source_data() function from the repmis package and store the results in an object named SEAGRASS. (b) Find and report the mean and standard deviation of fluorescence for the 8 treatment combinations. Does it appear that plants are more stressed without spermidine and at lower levels of salinity? (c) Are the assumptions for a factorial model satisfied with this data? (d) Create interaction plots for the factors spermidine and salinity. Based on your graphs, is there interaction between spermidine and salinity? (e) Write the hypotheses to test the main effects and the interaction for a 2 factor factorial design. (f) Test the hypotheses from (e). (g) Create and plot 99% family-wise confidence intervals for the pair-wise differences of the factors spermidine and salinity using the function TukeyHSD(). Interpret your confidence intervals. (h) Compute the means and the effects for the variables spermidine and salinity in the factorial model using the function model.tables(). (i) Assume the true means for the eight treatments are: > > > + >
MEANS with(data = SEAGRASS, + tapply(fluorescence, list(spermidine, salinity), mean)) 05PSU 11PSU 18PSU 36PSU NO 0.5827917 0.7074583 0.7736958 0.7818125 YES 0.6700104 0.7239271 0.8026875 0.7990209 > with(data = SEAGRASS, + tapply(fluorescence, list(spermidine, salinity), sd)) 05PSU 11PSU 18PSU 36PSU NO 0.05174214 0.06391753 0.01106553 0.00976514 YES 0.02473667 0.04936795 0.02479287 0.01492331 (c) Neither the qqPlot() nor the Shapiro-Wilk normality test suggest any deviations from normality with respect to the error terms of the factorial model stored in SEAGRASS.mod. The Levene’s test of homogeneity of variance finds insufficient evidence to suggest variances are not equal. The assumptions for a factorial model appear to be satisfied. > > > >
SEAGRASS.mod leveneTest(SEAGRASS.mod)
2 1 0 −1 −2 −3
Studentized Residuals(SEAGRASS.mod)
Levene's Test for Homogeneity of Variance (center = median) Df F value Pr(>F) group 7 0.8973 0.5243 24
−2
−1
0
1
2
norm Quantiles
(d)
> ggplot(data = SEAGRASS, aes(x = salinity, y = fluorescence, + color = spermidine, group = spermidine, + linetype = spermidine)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw() > ggplot(data = SEAGRASS, aes(x = spermidine, y = fluorescence, + color = salinity, group = salinity, + linetype = salinity)) + + stat_summary(fun.y = mean, geom = "point") + + stat_summary(fun.y = mean, geom = "line") + + theme_bw()
K14521_SM-Color_Cover.indd 528
30/06/15 11:51 am
Chapter 11:
Experimental Design
0.75
0.75
spermidine
0.70
NO YES
0.65
0.60
0.60
11PSU
salinity
18PSU
05PSU
0.70
0.65
05PSU
salinity
fluorescence
0.80
fluorescence
0.80
523
36PSU
11PSU 18PSU 36PSU
NO
spermidine
YES
The lines in the second graph cross, suggesting a small amount of interaction may be present. (e) The hypotheses to test the main effects and the interaction for a 2 factor factorial design are: • H0 : αi = 0 for all i versus H1 : αi = 0 for some i. • H0 : βi = 0 for all i versus H1 : βi = 0 for some i. • H0 : αβij = 0 for all (i, j) versus H1 : αβij = 0 for some (i, j). (f) > SEAGRASS.mod SR SR Df Sum Sq Mean Sq F value Pr(>F) spermidine 1 0.01123 0.01123 8.270 0.00832 ** salinity 3 0.14379 0.04793 35.285 5.83e-09 *** spermidine:salinity 3 0.00680 0.00227 1.668 0.20044 Residuals 24 0.03260 0.00136 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Based on the ℘-value of 0.0083, there is evidence to suggest some effect levels of spermidine are not zero. Based on the ℘-value of 0, there is evidence to suggest some effect levels of salinity are not zero. Based on the ℘-value of 0.2004, there is not sufficient evidence to suggest any effect interactions are not zero. (g) We can be 99% confident that adding spermidine reduces (increases the value of fluorescence) plant stress. We can be 99% confident that all pair-wise levels of salinity except 36PSU and 18PSU are significantly different in the amount of stress they produce on Cymodocea nodosa seagrass seedlings. > plot(TukeyHSD(SEAGRASS.mod, which = c("spermidine", "salinity"), + conf.level = 0.99))
K14521_SM-Color_Cover.indd 529
30/06/15 11:51 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
524
99% family−wise confidence level
0.00
0.02
0.04
0.06
Differences in mean levels of spermidine
36PSU−18PSU
18PSU−11PSU
YES−NO
18PSU−05PSU
99% family−wise confidence level
−0.05
0.00
0.05
0.10
0.15
0.20
Differences in mean levels of salinity
(h) The requested values are: > model.tables(SEAGRASS.mod, type = "means", se = TRUE, + cterms = c("spermidine", "salinity")) Tables of means Grand mean 0.7301755 spermidine spermidine NO YES 0.7114 0.7489 salinity salinity 05PSU 11PSU 18PSU 36PSU 0.6264 0.7157 0.7882 0.7904 Standard errors for differences of means spermidine salinity 0.01303 0.01843 replic. 16 8 > model.tables(SEAGRASS.mod, type = "effects", se = TRUE, + cterms = c("spermidine", "salinity")) Tables of effects spermidine spermidine NO YES -0.018736 0.018736
K14521_SM-Color_Cover.indd 530
30/06/15 11:51 am
Chapter 11: salinity salinity 05PSU 11PSU -0.10377 -0.01448
18PSU 0.05802
Experimental Design
525
36PSU 0.06024
Standard errors of effects spermidine salinity 0.009214 0.013031 replic. 16 8 (i) A function is written to find λ for each value of σ and then return the power for H0 : βi = 0 for all i versus H1 : βi = 0 for some i assuming n = 4. > > > > > > > + + + + + + + + + + + >
alpha > >
library(car) TR1 F) group 3 2.5623 0.06266 . 63 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Levene’s test does not reject the null hypothesis of constant variance at the α = 0.05 level for either stage (℘-value = 0.314) or defoliation (℘-value = 0.0627). The assumptions for Model (B) appear to be satisfied.
Model (C)
> modelC.aov TR TR Df Sum Sq Mean Sq F value Pr(>F) stage 4 13997052 3499263 5.343 0.000968 *** defoli 3 21878479 7292826 11.136 6.85e-06 *** Residuals 59 38638596 654891 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(i) The three basic assumptions concerning the errors: independence, normal distribution, and constant variance are assessed with the checking.plots() function.
> checking.plots(modelC.aov)
K14521_SM-Color_Cover.indd 539
30/06/15 11:51 am
534
Probability and Statistics with R, Second Edition: Exercises and Solutions
30
2 1 0 −1
standardized residuals
2 1 0 −1 −2
20
−3
39
40
50
60
−3
−2
−1
0
1
2
3
Theoretical Quantiles
Standardized residuals versus fitted values for modelC.aov
Density plot of standardized residuals for modelC.aov 0.4
ordered values
2
70
0.2
−2
0.1
−1
0
Density
1
0.3
2
3
10
70 2
3
70
39 0
0
500
1500
0.0
39
−3
standardized residuals
Normal Q−Q plot of standardized residuals from modelC.aov
−2
2
−3
standardized residuals
3
Standardized residuals versus ordered values for modelC.aov
2500
−3
fitted values
−2
−1
0
1
2
3
N = 67 Bandwidth = 0.3511
The model appears adequate. (ii) The model’s effects are estimated with the function model.tables(). The values for the decomposition of the Yi,j s are obtained using the function proj() and stored in the object EFF. > model.tables(modelC.aov, type = "effects") Tables of effects stage stage1 stage2 stage3 stage4 stage5 181.1 1.982 666.9 70.38 -815.1 rep 12.0 16.000 11.0 15.00 13.0 defoli control treat1 treat2 treat3 347 549.2 -13.84 -950.1 rep 16 18.0 17.00 16.0 > EFF head(EFF) 1 2 3 4 5 6
K14521_SM-Color_Cover.indd 540
(Intercept) stage defoli Residuals 1494.955 181.128109 366.5917 -375.6750 1494.955 181.128109 366.5917 1783.3250 1494.955 181.128109 366.5917 329.3250 1494.955 1.982276 366.5917 197.4708 1494.955 1.982276 366.5917 791.4708 1494.955 1.982276 366.5917 -1675.5292
30/06/15 11:51 am
Chapter 11:
Experimental Design
535
> VAL VAL[1:10] 1 2 3 4 5 1667 3826 2372 2061 2655
6 7 9 10 11 188 2309 3352 2987 1685
> SUNFLOWER[-TooBig, ]$yield[1:10] [1] 1667 3826 2372 2061 2655
188 2309 3352 2987 1685
Note that the values stored in the object VAL are the same as the original yield values used to construct Model (C).
(iii)
> CI CI Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ stage + defoli, data = SUNFLOWER[-TooBig, ]) $defoli diff lwr upr p adj treat1-control 202.2444 -532.8707 937.3595 0.8857552 treat2-control -360.8110 -1106.0312 384.4093 0.5791269 treat3-control -1297.0928 -2053.5200 -540.6656 0.0001663 treat2-treat1 -563.0554 -1286.6335 160.5228 0.1792500 treat3-treat1 -1499.3372 -2234.4523 -764.2221 0.0000075 treat3-treat2 -936.2818 -1681.5021 -191.0616 0.0081636 > > > >
K14521_SM-Color_Cover.indd 541
opar CI CI Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = yield ~ stage + defoli, data = SUNFLOWER[-TooBig, ]) $stage diff lwr upr p adj stage2-stage1 -179.14583 -1048.7577 690.46602 0.9775671 stage3-stage1 485.73485 -464.8130 1436.28274 0.6060833 stage4-stage1 -110.75000 -992.6974 771.19739 0.9965671 stage5-stage1 -996.23718 -1907.8381 -84.63626 0.0254664 stage3-stage2 664.88068 -227.0325 1556.79390 0.2348146 stage4-stage2 68.39583 -750.0167 886.80838 0.9993030 stage5-stage2 -817.09135 -1667.3761 33.19339 0.0653308 stage4-stage3 -596.48485 -1500.4293 307.45962 0.3517722 stage5-stage3 -1481.97203 -2414.8711 -549.07297 0.0003382 stage5-stage4 -885.48718 -1748.3838 -22.59057 0.0415604 > > > >
K14521_SM-Color_Cover.indd 542
opar > > >
ND + + + + > > + >
ggplot(data = ND, aes(x = stage, y = yield, fill = stage)) + geom_boxplot() + theme_bw() + labs(y = "Sunflower Yield (kg/ha)", x = "") + guides(fill = FALSE) library(plyr) mdf ggplot(data = mdf, aes(x = stage, y = MeanStage, fill = stage)) + + geom_bar(stat = "identity") + + geom_errorbar(aes(ymin = MeanStage - SE, ymax = MeanStage + SE), + width = 0.30) + + guides(fill = FALSE) + + theme_bw() + + labs(y = "Sunflower Yield (kg/ha)", x = "")
2500
4000
2000
Sunflower Yield (kg/ha)
Sunflower Yield (kg/ha)
3000
1500
2000
1000
1000
500
0
0 stage1
stage2
stage3
stage4
stage5
stage1
stage2
stage3
stage4
stage5
(j)
K14521_SM-Color_Cover.indd 544
30/06/15 11:51 am
Chapter 11:
Experimental Design
539
> levels(ND$defoli) levels(ND$defoli) [1] "CoT1" "T2T3" > + + + + > > + >
ggplot(data = ND, aes(x = defoli, y = yield, fill = defoli)) + geom_boxplot() + theme_bw() + labs(y = "Sunflower Yield (kg/ha)", x = "") + guides(fill = FALSE) library(plyr) mdf ggplot(data = mdf, aes(x = defoli, y = MeanStage, fill = defoli)) + + geom_bar(stat = "identity") + + geom_errorbar(aes(ymin = MeanStage - SE, ymax = MeanStage + SE), + width = 0.30) + + guides(fill = FALSE) + + theme_bw() + + labs(y = "Sunflower Yield (kg/ha)", x = "")
3000
1500
Sunflower Yield (kg/ha)
2000
Sunflower Yield (kg/ha)
4000
1000
2000
500
1000
0
0 CoT1
K14521_SM-Color_Cover.indd 545
T2T3
CoT1
T2T3
30/06/15 11:51 am
K14521_SM-Color_Cover.indd 546
30/06/15 11:51 am
Chapter 12 Regression
1. The manager of a URL commercial address is interested in predicting the number of megabytes downloaded, megasd, by clients according to the number of minutes they are connected, mconnected. The manager randomly selects (megabyte, minute) pairs, records the data, and stores the pairs (megasd, mconnected) in the file URLADDRESS. (a) Create a scatterplot of the data. Characterize the relationship between megasd and mconnected. (b) Fit a regression line to the data. Superimpose the resulting line in the plot created in part (a). ˆ (c) Compute the covariance matrix of the βs. (d) What is the standard error of βˆ1 ? (e) What is the covariance between βˆ0 and βˆ1 ? (f) Construct a 95% confidence interval for the slope of the regression line. (g) Compute R2 , Ra2 , and the residual variance for the fitted regression. (h) What assumptions need to be satisfied in order to use the model from part (b) for inferential purposes? (i) Are there any outlying observations? (j) Are there any influential observations? Compute and graph Cook’s distances, DFFITS, and DFBETAS to answer this question. Create a bubble plot of studentized residuals versus leverage values with plotted points proportional to Cook’s distance using the function influencePlot() from the car package. Does the bubble plot confirm your answer with respect to influential observations? (k) Estimate the mean value of megabytes downloaded by clients spending 5, 10, and 15 minutes on line. Construct the corresponding 90% confidence intervals. (l) Predict the megabytes downloaded by a client spending 30 minutes on line. Construct the corresponding 90% prediction interval. Solution: (a) > ggplot(data = URLADDRESS, aes(x = mconnected, y = megasd)) + + geom_point() + + theme_bw()
541
K14521_SM-Color_Cover.indd 547
30/06/15 11:51 am
542
Probability and Statistics with R, Second Edition: Exercises and Solutions 200
megasd
150
100
50
5
10
mconnected
15
20
Based on the graph, the relationship between megasd and mconnected is positive, linear, and strong.
(b)
> mod1a mod1a Call: lm(formula = megasd ~ mconnected, data = URLADDRESS) Coefficients: (Intercept) mconnected 6.189 9.831 > ggplot(data = URLADDRESS, aes(x = mconnected, y = megasd)) + + geom_point() + + theme_bw() + + geom_smooth(method = "lm")
K14521_SM-Color_Cover.indd 548
30/06/15 11:51 am
Chapter 12:
Regression
543
200
megasd
150
100
50
5
10
mconnected
15
20
The least squares regression line is Y = 6.189 + 9.8313x.
ˆ is computed with the function vcov(). (c) The variance matrix of the βs > vcov(mod1a) (Intercept) mconnected
(Intercept) mconnected 10.8661271 -0.85425034 -0.8542503 0.08931002
(d) > seb1hat seb1hat [1] 0.2988478 > # or > TR TR Estimate Std. Error t value Pr(>|t|) (Intercept) 6.188972 3.2963809 1.877505 7.090274e-02 mconnected 9.831263 0.2988478 32.897221 6.369479e-24 > seb1hat seb1hat [1] 0.2988478 ˆβˆ1 = sβˆ1 = 0.2988. The standard error of βˆ1 , σ
K14521_SM-Color_Cover.indd 549
30/06/15 11:51 am
544
Probability and Statistics with R, Second Edition: Exercises and Solutions
(e) > vcov(mod1a) (Intercept) mconnected
(Intercept) mconnected 10.8661271 -0.85425034 -0.8542503 0.08931002
> covb0b1 covb0b1 [1] -0.8542503 The covariance between βˆ0 and βˆ1 i s-0.8543. (f) > CI CI 2.5 % 97.5 % (Intercept) -0.5633578 12.94130 mconnected 9.2191007 10.44342 The 95% confidence interval for the slope of the regression line from part (b) is CI 0.95 (β1 ) = [9.2191, 10.4434]. (g) > summary(mod1a) Call: lm(formula = megasd ~ mconnected, data = URLADDRESS) Residuals: Min 1Q -17.4601 -5.0653
Median 0.0563
3Q 5.0002
Max 26.3222
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.1890 3.2964 1.878 0.0709 . mconnected 9.8313 0.2988 32.897 summary(mod1a)$r.squared [1] 0.9747799 > summary(mod1a)$adj.r.squared
K14521_SM-Color_Cover.indd 550
30/06/15 11:51 am
Chapter 12:
Regression
545
[1] 0.9738792 > summary(mod1a)$sigma [1] 8.992034 > summary(mod1a)$sigma^2 [1] 80.85668 The R2 , Ra2 , and residual variance are 0.9748, 0.9739, and 80.8567, respectively. (h) The errors are assumed to be independent, follow a normal distribution with mean zero, and have constant variance. Since the errors are unobservable, the residuals are analyzed.
Residuals vs Fitted
Normal Q−Q
50
100
150
3 2 1 −2
22
0
10 0
Residuals
−20
8
13
−1
Standardized residuals
13
20
30
> par(mfrow = c(1, 2)) > plot(mod1a, which = 1:2) > par(mfrow = c(1, 1))
200
Fitted values
22
−2
8
−1
0
1
2
Theoretical Quantiles
The normality assumptions for the errors appears reasonable based on the graphs of the residuals. (i) > TR TR rstudent unadjusted p-value Bonferonni p 13 3.536669 0.0014863 0.04459 Using the Bonferroni approach to test for outliers with the function outlierTest() from the car package, observation 13 is considered an outlier (℘-value = 0.0446) at the α = 0.05 level. (j)
K14521_SM-Color_Cover.indd 551
30/06/15 11:52 am
546
Probability and Statistics with R, Second Edition: Exercises and Solutions
> n n [1] 30 > > + > > > > + > > > >
p + > > > >
# plot(dfbetas(mod1a)[,1], type = "h", ylim = c(-1, 1), ylab = "", main = substitute(paste("DFBETAS for ", hat(beta)[0]))) CV CV)
named integer(0) > > + > > > >
# plot(dfbetas(mod1a)[,2], type = "h", ylim = c(-1, 1), ylab = "", main = substitute(paste("DFBETAS for ", hat(beta)[1]))) CV CV)
22 22
K14521_SM-Color_Cover.indd 552
30/06/15 11:52 am
Chapter 12:
Regression
547 DFFITS
0.0
−1.0
0.2
−0.5
0.4
0.0
0.6
0.5
0.8
1.0
1.0
Cook's Distance
10
15
20
25
30
0
5
10
15
20
Index
Index
^ DFBETAS for β0
^ DFBETAS for β1
25
30
25
30
0.5 0.0 −0.5 −1.0
−1.0
−0.5
0.0
0.5
1.0
5
1.0
0
0
5
10
15
20
25
30
Index
0
5
10
15
20
Index
Based on the graphs, DFFITs flags observations 13 and 22 while DFBETAS for βˆ1 flags observation 22 for further scrutiny.
> influencePlot(mod1a) StudRes Hat CookD 13 3.5366686 0.03335346 0.39106806 16 0.2602743 0.13608497 0.07429144 22 -2.1953572 0.11099209 0.51453521
K14521_SM-Color_Cover.indd 553
30/06/15 11:52 am
548
Probability and Statistics with R, Second Edition: Exercises and Solutions
2 1 −1
0
16
−2
Studentized Residuals
3
13
22 0.04
0.06
0.08
0.10
0.12
0.14
Hat−Values
The bubble plot also flags observations 13 and 22 for further study. Observations 13 and 22 are potentially influential observations; however, without further knowledge of the data, there is little one can do other than omit the values and see if the regression line changes significantly. > mod1b coef(summary(mod1a)) Estimate Std. Error t value Pr(>|t|) (Intercept) 6.188972 3.2963809 1.877505 7.090274e-02 mconnected 9.831263 0.2988478 32.897221 6.369479e-24 > coef(summary(mod1b)) Estimate Std. Error t value Pr(>|t|) (Intercept) 4.231873 2.5714045 1.645744 1.118536e-01 mconnected 10.008235 0.2390842 41.860720 2.176567e-25 Observations 13 and 22 are marginally influential as the estimates for the intercept and slope as well as the standard errors for the intercept and slope are marginally different when observations 13 and 22 are removed. (k) > CI CI fit lwr upr 1 55.34529 51.71411 58.97646 2 104.50160 101.70009 107.30311 3 153.65791 149.72931 157.58652 The estimated mean value of megabytes downloaded by clients spending 5, 10, and 15 minutes on line is 55.3453, 104.5016, and 153.6579 megabytes, respectively. The individual
K14521_SM-Color_Cover.indd 554
30/06/15 11:52 am
Chapter 12:
Regression
549
90% confidence intervals for clients spending 5, 10, and 15 minutes on line, respectively, are CI 0.90 [E(Yh )] = [51.7141, 58.9765], CI 0.90 [E(Yh )] = [101.7001, 107.3031], and CI 0.90 [E(Yh )] = [149.7293, 157.5865].
(l) > PI PI fit lwr upr 1 301.1269 282.4263 319.8274 The predicted megabytes downloaded by a client spending 30 minutes on line is 301.1269. The 90% prediction interval for megabytes downloaded for a client spending 30 minutes on line is PI 0.90 Yh(new) = [282.4263, 319.8274]. 2. A metallurgic company is investigating lost revenue due to worker illness. It is interested in creating a table of lost revenue to be used for future budgets and company forecasting plans. The data are stored in the data frame LOSTR. (a) Create a scatterplot of lost revenue versus number of ill workers. Characterize the relationship between lostrevenue and numbersick. (b) Fit a regression line to the data. Superimpose the resulting line in the plot created in part (a). ˆ (c) Compute the covariance matrix of the βs. (d) Create a 95% confidence interval for β1 . (e) Compute the coefficient of determination and the adjusted coefficient of determination. Provide contextual interpretations of both values. (f) What assumptions need to be satisfied in order to use the model from part (b) for inferential purposes? If there is/are any outlier/s in the data, remove it/them prior to answering the remainder of the questions. (g) Determine the expected lost revenues when 5, 15, and 25 workers are absent due to illness. (h) Compute a 95% prediction interval of lost revenues when 14 workers are absent due to illness. Solution: (a)
K14521_SM-Color_Cover.indd 555
30/06/15 11:52 am
550
Probability and Statistics with R, Second Edition: Exercises and Solutions
> ggplot(data = LOSTR, aes(x = numbersick, y = lostrevenue)) + + geom_point() + + theme_bw()
3000
lostrevenue
2000
1000
0
10
numbersick
20
Based on the graph, the relationship between lostrevenue and numbersick is positive, linear, and strong. There is one outlier. (b)
> mod2b mod2b Call: lm(formula = lostrevenue ~ numbersick, data = LOSTR) Coefficients: (Intercept) numbersick 294.8 96.9 > ggplot(data = LOSTR, aes(x = numbersick, y = lostrevenue)) + + geom_point() + + theme_bw() + + geom_smooth(method = "lm")
K14521_SM-Color_Cover.indd 556
30/06/15 11:52 am
Chapter 12:
Regression
551
3000
lostrevenue
2000
1000
0
10
numbersick
20
The least squares regression line is Y = 294.8392 + 96.897x.
ˆ is computed with the function vcov(). (c) The variance matrix of the βs > vcov(mod2b) (Intercept) numbersick
(Intercept) numbersick 12724.0361 -741.70261 -741.7026 53.28323
(d) > CI CI 2.5 % 97.5 % (Intercept) 61.49279 528.1855 numbersick 81.79671 111.9972 The 95% confidence interval for the slope of the regression line from part (b) is CI 0.95 (β1 ) = [81.7967, 111.9972]. (e) > summary(mod2b)$r.squared [1] 0.8845437 > summary(mod2b)$adj.r.squared [1] 0.8795239
K14521_SM-Color_Cover.indd 557
30/06/15 11:52 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
552
The coefficient of determination and the adjusted coefficient of determination are 0.8845 and 0.8795, respectively. According to the linear model, 88.4544% of the variability in lost revenue is accounted for by variation in the number of sick employees.
(f) To use the model from part (b) for inferential purposes, the residuals should follow a normal distribution.
> > > >
par(mfrow = c(1, 2)) plot(mod2b, which = 1:2) par(mfrow = c(1, 1)) outlierTest(mod2b)
rstudent unadjusted p-value Bonferonni p 4 42.11082 1.5825e-22 3.9562e-21
6 10
1500
1
2
3
4
4
0
Standardized residuals
500 0
Residuals
1000
4
500
Normal Q−Q 5
Residuals vs Fitted
6
2500
Fitted values
−2
18
−1
0
1
2
Theoretical Quantiles
Observation 4 is an outlier since it has a studentized residual whose absolute value is greater than 3.505 = t1−0.05/(2n),25−2−1 .
> > > >
K14521_SM-Color_Cover.indd 558
mod2bNO CI CI fit lwr upr 1 703.8017 684.3172 723.2862 2 1703.7961 1691.9454 1715.6469 3 2703.7906 2681.5851 2725.9961 The expected lost revenues when 5, 15, and 25 workers are absent due to illness are $703.8017, $1703.7961, and $2703.7906, respectively. (h) > PI PI fit lwr upr 1 1603.797 1545.119 1662.474 The 95% prediction interval is PI 0.95 [Yh (new)] = [$1545.119, $1662.4744] for loss in revenue with 14 workers absent. 3. To obtain a linear relationship between the employment (number of employed people = dependent variable) and the GDP (gross domestic product = response variable), a researcher has taken data from 12 regions. Use the following information to answer the questions: 12 i=1
K14521_SM-Color_Cover.indd 559
xi = 581
12 i=1
x2i = 28507
12 i=1
xi Yi = 2630
12 i=1
Yi = 53
12
Yi2 = 267
i=1
30/06/15 11:52 am
554
Probability and Statistics with R, Second Edition: Exercises and Solutions Source
df
SS
Regression Error
* *
* 22.08
MS Fobs * *
℘-value
* *
* *
(a) Complete the ANOVA table. (b) Decide if the regression is statistically significant. (c) Compute and interpret the coefficient of determination. (d) Calculate the model’s residual variance. (e) Write out the fitted regression line and construct a 90% confidence interval for the slope. Solution: (a) > > > > > > > > > > > > >
DF > > + > > > > >
Probability and Statistics with R, Second Edition: Exercises and Solutions
DF
SSR tobs pvalue c(tobs, pvalue) [1] 1.076022 0.150741 Do not reject H0 . H0 : β2 = −1 versus H1 : β2 < −1: > tobs pvalue c(tobs, pvalue) [1] -4.1751357607
0.0005444476
Reject H0 . 5. Given a simple linear regression model, show (a) σ ˆ2 =
ˆ2i i n−2
is an unbiased estimator of σ 2 .
(b) The diagonal element of the hat matrix can be expressed as hii = xi (X X)−1 xi = where xi = (1, xi ).
K14521_SM-Color_Cover.indd 563
(xi − x 1 ¯ )2 + , n ¯ )2 i (xi − x
30/06/15 11:52 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
558 Solution:
2 ˆi iε = σ 2 . In a simple linear regression model, Yi = (a) It must be shown that E n−2 β0 + β 1 x i + εi . Summing over all i and dividing by n yields ¯ + ε¯i . Y = β0 + β 1 x Since εˆi = Yi − Yi and Yi = βˆ0 + βˆ1 xi ,
εˆi = Yi − (βˆ0 + βˆ1 xi ).
The point (¯ x, Y ) is always on the simple linear regression line, so ¯. βˆ0 = Y − βˆ1 x Substituting into the expression of εˆi for βˆ0 gives εˆi = (Yi − Y ) + βˆ1 (¯ x − xi ). Since Yi − Y = β1 (xi − x ¯) + (εi − ε¯), ¯) + (εi − ε¯). εˆi = (β1 − βˆ1 )(xi − x
(12.1)
Squaring and summing both sides of (12.1) yields n i=1
εˆ2i
=
n i=1
¯)2 + 2(β1 − βˆ1 )(xi − x ¯)(εi − ε¯) + (εi − ε¯)2 (β1 − βˆ1 )2 (xi − x
= (β1 − βˆ1 )2
n i=1
(xi − x ¯)2 + 2(β1 − βˆ1 )
First
n i=1
(xi − x ¯)(εi − ε¯) +
Middle
n i=1
(εi − ε¯)2
Last
The of the First, Middle, and Last expressions will be taken to ascertain expectations n E ˆ2i . Recall that i=1 ε σ2 Var βˆ1 = . (12.2) n 2 (x − x ¯ ) i i=1
The expected value of First is n n 2 2 (xi − x ¯) = (xi − x ¯)2 · E (β1 − βˆ1 )2 E (β1 − βˆ1 ) i=1
i=1
=
n
(xi − x ¯)2 · Var βˆ1
i=1
(xi − x ¯ ) 2 · n
i=1
=
n
= σ2
K14521_SM-Color_Cover.indd 564
σ2 ¯ )2 i=1 (xi − x
30/06/15 11:52 am
Chapter 12:
Regression
559
The expected value of Middle is n (xi − x ¯)(εi − ε¯) E 2(β1 − βˆ1 ) i=1
n
= −2E (βˆ1 − β1 )
i=1
ˆ = −2E (β 1 − β 1 )
= −2E (βˆ1 − β1 )
(xi − x ¯)(εi − ε¯)
= −2E (βˆ1 − β1 ) = −2
n i=1 2
= −2σ
(xi − x ¯) εˆi − (β1 − βˆ1 )(xi − x ¯) i=1
n
2
From (12.1)
n
εˆi (xi − x ¯) + (βˆ1 − β1 )
i=1
By property 3 of the fitted regression line, n regression line, i=1 εˆi = 0:
n i=1
n
i=1
n i=1
(xi − x ¯)
2
xi εˆi = 0, and property 1 of the fitted
(xi − x ¯)
2
(xi − x ¯)2 · Var βˆ1 By (12.2)
Recall that εi ∼ N (0, σ 2 ). This means E[ε2i ] = E[(εi − 0)2 ] = Var[εi ] = σ 2 . For simple linear regression, it is also true that the covariance of any two error terms is zero. This implies that E[εi εj ] = 0 if i = j. The expected value of Last is n n 2 2 2 (εi − ε¯) = E εi − 2εi ε¯ + ε¯ E i=1
i=1
=
n i=1
n
E ε2i − 2 E [εi ε¯] + nE ε¯2 i=1
ε 1 + ε 2 + · · · + εn = nσ 2 − 2 + nVar[¯ ε] E εi n i=1 n
σ2 n = nσ 2 − σ 2 = σ 2 (n − 1)
= nσ 2 − 2σ 2 + n ·
E
i
εˆ2i
= E[First + Middle + Last]
= σ 2 − 2σ 2 + σ 2 (n − 1) = (n − 2)σ 2 2
ˆ2i iε =E σ ˆ = σ 2 Check =⇒ E n−2
K14521_SM-Color_Cover.indd 565
30/06/15 11:52 am
560
Probability and Statistics with R, Second Edition: Exercises and Solutions
(b)
(X X)
−1
1
= n
n
i=1 (xi
hii = xi (X X)−1 xi
n
2 i−1 xi n − i=1 xi
−
n
i=1
xi
n
n 2 1 x − x i i=1 i i=1 = [1 xi ] n n x i n n i=1 (xi − x ¯)2 − i=1 xi n n 1 x2i − xi i=1 xi i=1 = [1 xi ] n n − i=1 xi + nxi n i=1 (xi − x ¯ )2 n n n 1 2 xi − xi xi + xi − xi + nxi = n i=1 i=1 n i=1 (xi − x ¯)2 i=1 n n 1 x2i − 2xi xi + nx2i = n 2 i=1 n i=1 (xi − x ¯) i=1 n
1
Recall that
−x ¯ )2
n
i=1
x2i
n
−x ¯=
n
x) i=1 (xi −¯
2
n
and
n
i=1
xi = n¯ x.
2 x i=1 i −¯ = x2 + x ¯2 − 2xi x ¯ + x2i n n 2 ¯) Add for form. i=1 (xi − x n (x −x ¯ )2 1 i 2 i=1 + (xi − x ¯) = n n ¯ )2 i=1 (xi − x 1
=
n
1 (xi − x ¯ )2 + n n ¯ )2 i=1 (xi − x
Check
6. Show that (12.63) and (12.64) are algebraically equivalent: Di = Note: ri = HINT:
√ εˆi σ ˆ 1−hii
ˆ (X X)(βˆ(i) − β) ˆ (βˆ(i) − β) ri2 hii = . pˆ σ2 p 1 − hii
and hii = xi (X X)−1 xi .
X(i) X(i)
Solution:
−1
= (X X)−1 + (X X)−1 xi xi (X X)−1 .
(12.3)
1−hii
Since βˆ = (X X)−1 X Y, it follows that βˆ(i) where “(i)” means “without case i” is βˆ(i) = (X(i) X(i) )−1 X(i) Y(i) .
K14521_SM-Color_Cover.indd 566
(12.4)
30/06/15 11:52 am
Chapter 12:
Regression
561
Rewriting (12.4) using the HINT in (12.3) gives (X X)−1 xi xi (X X)−1 X(i) Y(i) βˆ(i) = (X X)−1 + 1 − hii where X = X(i) xi and Y = Y(i) Yi . Using the expressions for X and Y, βˆ = (X X)−1 X Y can be rewritten as
(12.5)
βˆ = (X X)−1 X(i) Y(i) + (X X)−1 xi Yi
(12.6)
Subtracting (12.6) from (12.5) gives −1 (X X) xi xi (X X)−1 X(i) Y(i) − (X X)−1 xi Yi βˆ(i) − βˆ = 1 − hii −1 x (X X) Xi Y(i) − Yi = (X X)−1 xi i 1 − hii −1 (X X) xi −1 xi (X X) Xi Y(i) − (1 − hii )Yi = 1 − hii Substituting for hii = xi (X X)−1 xi inside the brackets gives (X X)−1 xi 1 − hii (X X)−1 xi = 1 − hii =
xi (X X)−1 Xi Y(i) − (1 − xi (X X)−1 xi )Yi xi (X X)−1 Xi Y(i) + xi (X X)−1 xi Yi − Yi
Going back to the expression for βˆ from (12.6) gives =
(X X)−1 x (X X)−1 xi ˆ i [−ˆ εi ] x i β − Yi = 1 − hii 1 − hii
Keep this expression handy for problem 7: −(X X)−1 xi εˆi βˆ(i) − βˆ = 1 − hii
(12.7)
Di can now be written ˆ (X X)(βˆ(i) − β) ˆ (βˆ(i) − β) 2 pˆ σ −1 x (X X)−1 xi (X X) [−ˆ εi ] i (X X) [−ˆ εi ] 1 − hii 1 − hii = pˆ σ2 −1 εˆ x (X X) xi εˆi = i i2 pˆ σ (1 − hii )2 εˆi hii εˆi = pˆ σ 2 (1 − hii )2
Di =
K14521_SM-Color_Cover.indd 567
30/06/15 11:52 am
562
Probability and Statistics with R, Second Edition: Exercises and Solutions
Since ri =
εˆi εˆ2 and hii is a constant, , ri2 = 2 i σ ˆ (1 − hii ) σ ˆ 1 − hii √
Di =
ri2 hii p(1 − hii )
Check
7. Show that (12.65) and (12.66) are algebraically equivalent: Yi − Yi(i) √ = ri∗ DFFITSi = σ ˆ(i) hii
hii . 1 − hii
Solution: Recall that ri∗ = Also note that
εˆ √i . σ ˆ(i) 1 − hii Yi − Yi(i) = xi βˆ − xi βˆ(i)
The expression in (12.7) as well as that for hii allow simplification to (X X)−1 xi εˆi 1 − hii hii εˆi = 1 − hii
= xi Yi − Yi(i)
The original expression becomes
DFFITSi =
Yi − Yi(i) hii εˆi 1 √ √ = · 1 − hii σ σ ˆ(i) hii ˆ(i) hii √ εˆi · hii √ √ = σ ˆ(i) 1 − hii · 1 − hii hii = ri∗ Check 1 − hii
8. Show that the SSE in a linear model expressed in summation notation is equivalent to the SSE expressed in matrix notation: SSE =
n i=1
(Yi − Yi )2 = Y Y − βˆ X Y.
Solution: Note that βˆ = Y X(X X)−1 .
K14521_SM-Color_Cover.indd 568
30/06/15 11:52 am
Chapter 12:
SSE =
n i=1
Regression
563
(Yi − Yi )2 = εˆ ε
ˆ (Y − Xβ) ˆ = (Y − Xβ) = Y Y − βˆ X Y − Y Xβˆ + βˆ X Xβˆ
= Y Y − βˆ X Y − Y Xβˆ + Y X(X X)−1 X Xβˆ = Y Y − βˆ X Y Check
9. Show that the SSR in a linear model expressed in summation notation is equivalent to the SSR expressed in matrix notation: SSR =
n i=1
(Yi − Y¯ )2 = βˆ X Y −
1 Y JY. n
Solution: It is known that SSR = SST − SSE , so SSR =
n i=1
(Yi − Y¯ )2 =
n i=1
(Yi − Y¯ )2 − n
n i=1
(Yi − Yi )2
− Y Y − βˆ X Y n i=1 1 = Y Y − Y JY − Y Y − βˆ X Y n 1 = βˆ X Y − Y JY Check n =
n
Yi2 −
(
i=1
Yi )
2
10. Show that the trace of the hat matrix H is equal to p, the number of parameters (βs), in a multiple linear regression model. Solution: Note: Suppose a matrix A exists and is n × n. Then, (A) = For any matrices A, B, and C,
n
i=1
aii .
(ABC) = (BCA) = (CAB)
(12.8)
when such products exist. The projection matrix is H = X(X X)−1 X . X X(X X)−1 ) = (Ip×p ) = p (H) = (X(X X)−1 X ) = ( By (12.8)
K14521_SM-Color_Cover.indd 569
30/06/15 11:52 am
564
Probability and Statistics with R, Second Edition: Exercises and Solutions
11. The data frame HSWRESTLER contains information on nine variables for a group of 78 high school wrestlers that was collected by the human performance lab at Appalachian State University. The variables are age (in years), ht (height in inches), wt (weight in pounds), abs (abdominal skinfold measure), triceps (tricep skinfold measure), subscap (subscapular skinfold measure), hwfat (hydrostatic determination of fat), tanfat (Tanita determination of fat), and skfat (skinfold determination of fat). Use hwfat (Y ), abs (x1 ), and triceps (x2 ) to verify empirically the value obtained for SSR(x2 , x1 ) using quadratic forms. Solution: R Code 12.1 > mod1 anova(mod1) Analysis of Variance Table Response: hwfat Df Sum Sq Mean Sq F value Pr(>F) abs 1 5072.8 5072.8 541.365 < 2.2e-16 *** triceps 1 242.2 242.2 25.844 2.639e-06 *** Residuals 75 702.8 9.4 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > SSR SSR [1] 5315.008 The value obtained previously with the lm() and anova() functions for SSR = 5315.0081. Recall that SSR = βˆ X Y − n1 Y JY = Y (H − n1 J)Y
> n Y dim(Y) [1] 78
1
> X dim(X) [1] 78
K14521_SM-Color_Cover.indd 570
3
30/06/15 11:52 am
Chapter 12:
Regression
565
> H dim(H) [1] 78 78 > J dim(J) [1] 78 78 > SSR SSR [,1] [1,] 5315.008 Using quadratic forms, the value for SSR = 5315.0081, the same value computed from using R Code 12.1. 12. The data frame KINDER contains the height in inches and weight in pounds of 20 children from a kindergarten class. Use all 20 observations and construct a regression model where the results are stored in the object mod by regressing weight on height. (a) Create a scatterplot of weight versus height to verify a possible linear relationship between the two variables. (b) Compute and display the hat values for mod in a graph. Use the graph to identify the two largest hat values. Superimpose a horizontal line at 2p/n. Remove the values that exceed 2p/n and regress weight on height, storing the results in an object named modk. (c) Remove case 19 from the original data frame KINDER and regress weight on height, storing the results in modk19. Is the child with the largest hat value an influential observation if one considers the 19 observations without case 19 from the original data frame? Compute and consider Cook’s Di , DFFITSi , and DFBETASk(i) , in reaching a conclusion. Specifically, produce a graph showing hii , the differences in βˆ1(i) − βˆ1 , DF BET ASk(i) , studentized residuals, DF F IT Si , Cook’s Di , and a bubble-plot of studentized residuals versus leverage values with plotted points proportional to Cook’s distance along with the corresponding values that flag observations for further scrutiny assuming α = 0.10. Hint: Use the functions fortify() from the ggplot2 package and lm.influence(). (d) Remove case 20 from the data frame KINDER and regress weight on height, storing the results in modk20. Is the child with the largest hat value an influential observation if one considers the 19 observations without case 20 from the original data frame? Compute and consider Cook’s Di , DFFITSi , and DFBETASk(i) in reaching a conclusion. Specifically, produce a graph showing hii , the differences in βˆ1(i) − βˆ1 , DFBETASk(i) , studentized residuals, DFFITSi , Cook’s Di , and a bubble-plot of studentized residuals versus leverage values with plotted points proportional to Cook’s distance along with the corresponding values that flag observations for further scrutiny assuming α = 0.10. (e) Create a scatterplot showing all 20 children. Use a solid circle to identify case 19 and a solid triangle to identify case 20. Superimpose the lines for models mod (lty = 1), modk (lty = 2), mod19 (lty = 3), and mod20 (lty = 4).
K14521_SM-Color_Cover.indd 571
30/06/15 11:52 am
566
Probability and Statistics with R, Second Edition: Exercises and Solutions
Solution: (a) > ggplot(data = KINDER, aes(x = wt, y = ht)) + + geom_point() + theme_bw() + + labs(x = "Weight (pounds)", y = "Height (inches)")
Height (inches)
48
45
42
39 30
40
50
Weight (pounds)
60
Based on the scatterplot, a linear relationship between ht and wt appears reasonable; however, two points will bear further scrutiny. (b) > > > > + + + > >
mod > > > > > + + + > + + + + > > + + +
K14521_SM-Color_Cover.indd 573
modk19 +
K14521_SM-Color_Cover.indd 575
modk20 line2 coef(summary(line2)) Estimate Std. Error t value Pr(>|t|) (Intercept) -5373.946 35455.1129 -0.1515704 8.814204e-01 area 3183.757 433.2997 7.3477006 1.641688e-06 > line3 coef(summary(line3)) Estimate Std. Error t value Pr(>|t|) (Intercept) 95394.348 20555.5010 4.640818 4.254153e-05 area 1766.622 231.2392 7.639803 4.051498e-09 (c) > ggplot(data = VIT2005, aes(x = area, y = totalprice, + color = conservation1)) + + geom_point() + + geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) + + theme_bw() > rm(VIT2005) # Clean up
K14521_SM-Color_Cover.indd 579
30/06/15 11:52 am
574
Probability and Statistics with R, Second Edition: Exercises and Solutions 6e+05
5e+05
totalprice
conservation1
4e+05
A B C
3e+05
2e+05
80
120
area
160
Case Study: Biomass Data and ideas for this case study come from (Goicoa et al., 2011). 14. To estimate the amount of carbon dioxide retained in a tree, its biomass needs to be known and multiplied by an expansion factor (there are several alternatives in the literature). To calculate the biomass, specific regression equations by species are frequently used. These regression equations, called allometric equations, estimate the biomass of the tree by means of some known characteristics, typically diameter and/or height of the stem and branches. The BIOMASS file contains data of 42 beeches (Fagus Sylvatica) from a forest of Navarra (Spain) in 2006, where • diameter: diameter of the stem in centimeters • height: height of the tree in meters • stemweight: weight of the stem in kilograms • aboveweight: aboveground weight in kilograms (a) Create a scatterplot of aboveweight versus diameter. Superimpose a regression line over the plot just created.
Is the relationship linear?
(b) Create a scatterplot of log(aboveweight) versus log(diameter). Is the relationship linear? Superimpose a regression line over the plot just created. (c) Fit the regression model log(aboveweight) = β0 + β1 log(diameter), and compute R2 , Ra2 , and the variance of the residuals. (d) Introduce log(height) as an explanatory variable and fit the model log(aboveweight) = β0 + β1 log(diameter) + β2 log(height). What is the effect of introducing log(height) in the model?
K14521_SM-Color_Cover.indd 580
30/06/15 11:52 am
Chapter 12:
Regression
575
(e) Complete the Analysis questions for the model in (d). Analysis questions:
(1) Estimate the model’s parameters and their standard errors. Provide an interpretation for the model’s parameters. ˆ (2) Compute the variance-covariance matrix of the βs. (3) Provide 95% confidence intervals for β1 and β2 . (4) Compute the R2 , Ra2 , and the residual variance. (5) Construct a graph with the default diagnostics plots of R. (6) Can homogeneity of variance be assumed? (7) Do the residuals appear to follow a normal distribution? (8) Are there any outliers in the data? (9) Are there any influential observations in the data?
(f) Obtain predictions of the aboveground biomass of trees with diameters diameter = seq(12.5, 42.5, 5) and heights height = seq(10, 40, 5). Note that the weight predictions are obtained from back transforming the logarithm. The bias correction is obtained by means of the lognormal distribution: If Ypred is the prediction, the corrected (back-transformed) prediction Ypred is given by ˆ 2 /2) Ypred = exp(Ypred + σ where σ ˆ 2 is the variance of the error term.
Solution: (a) > ggplot(data = BIOMASS, aes(x = diameter, y = aboveweight)) + + geom_point() + + stat_smooth(method = "lm", se = FALSE) + + theme_bw()
K14521_SM-Color_Cover.indd 581
30/06/15 11:52 am
576
Probability and Statistics with R, Second Edition: Exercises and Solutions 4000
aboveweight
3000
2000
1000
0
20
40
diameter
60
The association between aboveweight and diameter is positive, the general form of the relationship is slightly curvilinear.
(b)
> ggplot(data = BIOMASS, aes(x = log(diameter), y = log(aboveweight))) + + geom_point() + + stat_smooth(method = "lm", se = FALSE) + + theme_bw()
K14521_SM-Color_Cover.indd 582
30/06/15 11:52 am
Chapter 12:
Regression
577
8
log(aboveweight)
7
6
5
4
2.5
3.0
3.5
log(diameter)
4.0
(c) > modlog summary(modlog) Call: lm(formula = log(aboveweight) ~ log(diameter), data = BIOMASS) Residuals: Min 1Q -0.48510 -0.12682
Median 0.02701
3Q 0.10766
Max 0.32104
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.5015 0.1920 -7.822 1.38e-09 *** log(diameter) 2.2806 0.0542 42.076 < 2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1842 on 40 degrees of freedom Multiple R-squared: 0.9779,Adjusted R-squared: 0.9774 F-statistic: 1770 on 1 and 40 DF, p-value: < 2.2e-16 > > > >
r2 > > >
r2 vcov(modlogH) (Intercept) log(diameter) log(height) (Intercept) 0.1022432727 -0.0006066342 -0.032216921 log(diameter) -0.0006066342 0.0024646465 -0.002596642 log(height) -0.0322169214 -0.0025966423 0.013365634 (3) > CI CI 2.5 % 97.5 % (Intercept) -3.4238270 -2.1302958 log(diameter) 2.0773698 2.2782036 log(height) 0.2953374 0.7630233 The 95% confidence interval for β1 is [2.0774, 2.2782], and the 95% confidence interval for β2 is [0.2953, 0.763]. (4) > summary(modlogH) Call: lm(formula = log(aboveweight) ~ log(diameter) + log(height), data = BIOMASS) Residuals: Min 1Q Median -0.26519 -0.11243 -0.01637
3Q 0.07720
Max 0.38024
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.77706 0.31976 -8.685 1.18e-10 *** log(diameter) 2.17779 0.04965 43.867 < 2e-16 *** log(height) 0.52918 0.11561 4.577 4.71e-05 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1505 on 39 degrees of freedom Multiple R-squared: 0.9856,Adjusted R-squared: 0.9849 F-statistic: 1337 on 2 and 39 DF, p-value: < 2.2e-16 > > > >
K14521_SM-Color_Cover.indd 585
r2 hcv hcv
# 2*p/n # hii CV
[1] 0.1702128 > which(hatvalues(modelD) > hcv) 10 22 45 10 22 45 > outlierTest(modelD) No Studentized residuals with Bonferonni p < 0.05 Largest |rstudent|: rstudent unadjusted p-value Bonferonni p 46 -3.030668 0.0041662 0.19581
K14521_SM-Color_Cover.indd 608
30/06/15 11:53 am
Chapter 12:
Regression
603
There are three observations (10, 22, and 45) with a leverage value that exceeds 0.1702. (vii) > outlierTest(modelD) No Studentized residuals with Bonferonni p < 0.05 Largest |rstudent|: rstudent unadjusted p-value Bonferonni p 46 -3.030668 0.0041662 0.19581 There are no outliers according to a Bonferroni test. (viii) > > + + + + + > > >
DF CI CI 2.5 % 97.5 % (Intercept) -2012.6283481 2879.394461 fruit 0.7542251 1.091433 smallareaR67 -2116.2680000 3432.813298 smallareaR68 -5746.3330265 334.455286 The 95% confidence interval for the fruit coefficient is [0.7542, 1.0914]. (h) > coef(summary(modelD)) Estimate (Intercept) 433.3830564 fruit 0.9228291 smallareaR67 658.2726489 smallareaR68 -2705.9388703
Std. Error t value Pr(>|t|) 1.212883e+03 0.3573165 7.226027e-01 8.360421e-02 11.0380687 3.960570e-14 1.375788e+03 0.4784696 6.347399e-01 1.507614e+03 -1.7948481 7.970919e-02
> IOF IOF [1] 9228.291 Holding all other quantities in the model constant, an increase in fruit of 10, 000 m2 would increase the expected observed fruits by 9228.2905 m2 . (i) > newdata PredictEstimate PredictEstimate 1 2 3 89988.66 4503208.64 2658694.17 The predicted area of fruit trees for small areas R63, R67, and R68 are 89988.6641 m2 , 4503208.6404 m2 , and 2658694.1669 m2 . (j) The function ggplot() is used with the aesthetic color = smallarea to distinguish the small areas in the plot. The regression lines are nearly parallel suggesting a model with an identical slope but different intercepts for each small area may be reasonable. > ggplot(data = SATFRUIT, aes(x = fruit, y = observed, color = smallarea)) + + geom_point() + + stat_smooth(method = "lm", se = FALSE) + + theme_bw()
K14521_SM-Color_Cover.indd 610
30/06/15 11:53 am
Chapter 12:
Regression
605
10000
observed
smallarea R63 R67 R68
5000
0 0
5000
fruit
10000
(k) > ggplot(data = DF, aes(x = observed, y = .fitted, color = smallarea)) + + geom_point() + + theme_bw() + + geom_abline(intercept = 0, slope = 1, lty = "dashed") + + labs(x = "Observed", y = "Fitted")
10000
smallarea
Fitted
R63 R67 R68
5000
0
0
5000
Observed
10000
A straight line appears to model the relationship between fitted and observed values. (l) Recall that the direct technique estimates the total surface area by multiplying the mean of the observed surface area in the sampled segments by the total number of segments in every small area. The direct and model estimates initially in m2 are converted to hectares by dividing each estimate by 10,000. > DirectEstimate DirectEstimate R63 R67 R68 198466.5 5867470.0 3589159.5
K14521_SM-Color_Cover.indd 611
30/06/15 11:53 am
606 > + > > > >
Probability and Statistics with R, Second Edition: Exercises and Solutions
newdata > >
mod.glm > >
ggplot(data = VIT2005, aes(x = totalprice)) + geom_density(fill = "pink") + theme_bw() MD scatterplotMatrix(~totalprice + toilets + garage + elevator + + storage, data = VIT2005)
K14521_SM-Color_Cover.indd 616
30/06/15 11:53 am
Chapter 12:
60
Regression
100 140 180
2
611
4
6
8
12
2e+05
4e+05
totalprice
120
60
100 140 180
area
12
0
40
80
age
2
4
6
8
floor
3
4
5
6
7
rooms
2e+05
K14521_SM-Color_Cover.indd 617
4e+05
0
40
80
120
3
4
5
6
7
30/06/15 11:53 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
612
1.0
1.4
1.8
0.0
0.4
0.8
2e+05
4e+05
totalprice
2.0
1.0
1.4
1.8
toilets
0.0
1.0
garage
2.0
0.0
0.4
0.8
elevator
0.0
1.0
storage
2e+05
4e+05
0.0
1.0
2.0
0.0
1.0
2.0
The variable totalprice appears to have a moderate linear relationship with area. (c) > NUM COR COR area age floor rooms toilets garage [1,] 0.8092125 -0.2724497 0.02921993 0.525627 0.6875706 0.5237425 elevator storage [1,] 0.5109393 0.2673579 The highest three correlations with totalprice occur with area (0.8092), toilets (0.6876), and rooms (0.5256). Model (A) The functions drop1() and update() are used to create a model using backward elimination.
K14521_SM-Color_Cover.indd 618
30/06/15 11:53 am
Chapter 12:
Regression
613
> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + age + floor + rooms + out + conservation + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
8.5891e+10 4412.6 area 1 4.4519e+10 1.3041e+11 4501.7 87.5972 < 2.2e-16 *** zone 22 1.1171e+11 1.9760e+11 4550.3 9.9910 < 2.2e-16 *** category 6 9.5199e+09 9.5411e+10 4423.5 3.1219 0.006303 ** age 1 3.7563e+06 8.5895e+10 4410.6 0.0074 0.931591 floor 1 3.8440e+07 8.5929e+10 4410.7 0.0756 0.783639 rooms 1 1.0656e+09 8.6956e+10 4413.3 2.0967 0.149472 out 3 3.8946e+09 8.9785e+10 4416.3 2.5544 0.057135 . conservation 3 1.0031e+09 8.6894e+10 4409.2 0.6579 0.579069 toilets 1 4.7971e+09 9.0688e+10 4422.5 9.4389 0.002477 ** garage 1 1.4771e+10 1.0066e+11 4445.2 29.0627 2.328e-07 *** elevator 1 5.4265e+09 9.1317e+10 4424.0 10.6772 0.001314 ** streetcategory 3 3.5550e+09 8.9446e+10 4415.5 2.3316 0.076019 . heating 3 4.3202e+09 9.0211e+10 4417.3 2.8335 0.039877 * storage 1 1.6433e+09 8.7534e+10 4414.8 3.2334 0.073937 . --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + floor + rooms + out + conservation + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
8.5895e+10 4410.6 area 1 4.4894e+10 1.3079e+11 4500.3 88.8523 < 2.2e-16 *** zone 22 1.1220e+11 1.9810e+11 4548.8 10.0942 < 2.2e-16 *** category 6 9.6605e+09 9.5555e+10 4421.9 3.1866 0.0054598 ** floor 1 3.6864e+07 8.5931e+10 4408.7 0.0730 0.7874016 rooms 1 1.0655e+09 8.6960e+10 4411.3 2.1088 0.1482969 out 3 4.0439e+09 8.9938e+10 4414.7 2.6678 0.0493562 * conservation 3 1.3497e+09 8.7244e+10 4408.0 0.8905 0.4473466 toilets 1 4.7993e+09 9.0694e+10 4420.5 9.4987 0.0023994 ** garage 1 1.4767e+10 1.0066e+11 4443.2 29.2260 2.153e-07 *** elevator 1 5.6740e+09 9.1569e+10 4422.6 11.2298 0.0009919 *** streetcategory 3 3.5550e+09 8.9450e+10 4413.5 2.3453 0.0746761 . heating 3 4.3716e+09 9.0266e+10 4415.5 2.8840 0.0373395 *
K14521_SM-Color_Cover.indd 619
30/06/15 11:53 am
614
Probability and Statistics with R, Second Edition: Exercises and Solutions
storage --Signif. codes:
1 1.6950e+09 8.7590e+10 4412.9
3.3547 0.0687625 .
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + rooms + out + conservation toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
8.5931e+10 4408.7 area 1 4.4857e+10 1.3079e+11 4498.3 89.2643 < 2.2e-16 zone 22 1.1713e+11 2.0306e+11 4552.2 10.5949 < 2.2e-16 category 6 9.8951e+09 9.5826e+10 4420.5 3.2818 0.0044204 rooms 1 1.0703e+09 8.7002e+10 4409.4 2.1299 0.1462887 out 3 4.0364e+09 8.9968e+10 4412.7 2.6774 0.0487312 conservation 3 1.3563e+09 8.7288e+10 4406.1 0.8997 0.4426615 toilets 1 4.8110e+09 9.0742e+10 4418.6 9.5738 0.0023062 garage 1 1.4733e+10 1.0066e+11 4441.2 29.3185 2.054e-07 elevator 1 5.7376e+09 9.1669e+10 4420.8 11.4175 0.0009011 streetcategory 3 3.5188e+09 8.9450e+10 4411.5 2.3341 0.0757310 heating 3 4.4146e+09 9.0346e+10 4413.6 2.9283 0.0352446 storage 1 1.6588e+09 8.7590e+10 4410.9 3.3010 0.0709881 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
*** *** ** * ** *** *** . * .
> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + rooms + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
8.7288e+10 4406.1 area 1 4.4067e+10 1.3135e+11 4493.2 87.8431 < 2.2e-16 zone 22 1.1785e+11 2.0514e+11 4548.4 10.6787 < 2.2e-16 category 6 1.2678e+10 9.9966e+10 4423.7 4.2122 0.0005529 rooms 1 1.0246e+09 8.8312e+10 4406.7 2.0425 0.1547526 out 3 4.5287e+09 9.1816e+10 4411.2 3.0092 0.0316923 toilets 1 5.1432e+09 9.2431e+10 4416.6 10.2525 0.0016231 garage 1 1.5621e+10 1.0291e+11 4440.0 31.1392 9.06e-08 elevator 1 5.7882e+09 9.3076e+10 4418.1 11.5382 0.0008449 streetcategory 3 3.5484e+09 9.0836e+10 4408.8 2.3578 0.0734020 heating 3 3.9987e+09 9.1286e+10 4409.9 2.6570 0.0499673 storage 1 1.6695e+09 8.8957e+10 4408.3 3.3280 0.0698262 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
K14521_SM-Color_Cover.indd 620
*** *** *** * ** *** *** . * .
30/06/15 11:53 am
Chapter 12:
Regression
615
> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
8.8312e+10 4406.7 area 1 6.3300e+10 1.5161e+11 4522.5 125.4358 < 2.2e-16 zone 22 1.1695e+11 2.0526e+11 4546.5 10.5341 < 2.2e-16 category 6 1.2113e+10 1.0043e+11 4422.7 4.0004 0.0008860 out 3 4.8644e+09 9.3177e+10 4412.4 3.2131 0.0243109 toilets 1 5.4584e+09 9.3771e+10 4417.8 10.8163 0.0012163 garage 1 1.5751e+10 1.0406e+11 4440.5 31.2116 8.718e-08 elevator 1 6.3078e+09 9.4620e+10 4419.7 12.4996 0.0005209 streetcategory 3 3.3915e+09 9.1704e+10 4408.9 2.2402 0.0852905 heating 3 4.1435e+09 9.2456e+10 4410.7 2.7369 0.0450560 storage 1 1.5008e+09 8.9813e+10 4408.4 2.9740 0.0863783 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*** *** *** * ** *** *** . * .
> model.be drop1(model.be, test = "F") Single term deletions Model: totalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating Df Sum of Sq RSS AIC F value Pr(>F)
8.9813e+10 4408.4 area 1 6.4458e+10 1.5427e+11 4524.3 126.3141 < 2.2e-16 zone 22 1.1977e+11 2.0959e+11 4549.1 10.6688 < 2.2e-16 category 6 1.1488e+10 1.0130e+11 4422.6 3.7521 0.001540 out 3 4.9656e+09 9.4779e+10 4414.1 3.2436 0.023353 toilets 1 5.6974e+09 9.5511e+10 4419.8 11.1647 0.001018 garage 1 1.6124e+10 1.0594e+11 4442.4 31.5962 7.32e-08 elevator 1 6.8119e+09 9.6625e+10 4422.3 13.3488 0.000341 streetcategory 3 4.5260e+09 9.4339e+10 4413.1 2.9564 0.033902 heating 3 4.2935e+09 9.4107e+10 4412.5 2.8046 0.041267 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*** *** ** * ** *** *** * *
> formula(model.be) totalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating > modelA > > >
modelAg >
set.seed(5) cv.error5 >
mgof >
library(MASS) SCOPE + > > >
modelC SCOPE mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ 1 Df Sum of Sq RSS AIC F value Pr(>F)
1.0421e+12 4860.7 area 1 6.8239e+11 3.5970e+11 4630.8 409.7694 < 2.2e-16 zone 22 6.2786e+11 4.1424e+11 4703.6 13.4346 < 2.2e-16 category 6 3.8142e+11 6.6068e+11 4773.4 20.3020 < 2.2e-16 age 1 7.7353e+10 9.6474e+11 4845.9 17.3190 4.563e-05 floor 1 8.8974e+08 1.0412e+12 4862.5 0.1846 0.6678953 rooms 1 2.8791e+11 7.5418e+11 4792.2 82.4595 < 2.2e-16 out 3 1.5831e+10 1.0263e+12 4863.4 1.1004 0.3499449 conservation 3 9.4240e+10 9.4785e+11 4846.1 7.0923 0.0001449 toilets 1 4.9265e+11 5.4944e+11 4723.2 193.6754 < 2.2e-16 garage 1 2.8585e+11 7.5624e+11 4792.8 81.6462 < 2.2e-16 elevator 1 2.7205e+11 7.7005e+11 4796.8 76.3102 6.764e-16 streetcategory 3 1.2246e+11 9.1963e+11 4839.5 9.4988 6.440e-06 heating 3 1.6150e+11 8.8059e+11 4830.0 13.0827 7.067e-08 storage 1 7.4489e+10 9.6760e+11 4846.6 16.6283 6.393e-05 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
*** *** *** *** *** *** *** *** *** *** *** ***
> mod.fs add1(mod.fs, scope = SCOPE, test = "F")
K14521_SM-Color_Cover.indd 632
30/06/15 11:53 am
Chapter 12:
Regression
627
Single term additions Model: totalprice ~ area Df
zone 22 category 6 age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 garage 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0
Sum of Sq 1.7960e+11 9.3970e+10 5.5354e+10 1.4251e+09 1.4929e+08 5.1247e+09 3.6835e+10 5.6364e+10 6.8073e+10 4.5308e+10 1.5886e+09 2.5654e+10 2.2452e+10
RSS 3.5970e+11 1.8010e+11 2.6573e+11 3.0435e+11 3.5828e+11 3.5956e+11 3.5458e+11 3.2287e+11 3.0334e+11 2.9163e+11 3.1440e+11 3.5812e+11 3.3405e+11 3.3725e+11
AIC 4630.8 4524.0 4576.8 4596.4 4632.0 4632.8 4633.7 4613.3 4595.7 4587.1 4603.5 4635.9 4620.7 4618.8
F value
Pr(>F)
8.7936 12.3768 39.1032 0.8552 0.0893 1.0261 8.1001 39.9496 50.1857 30.9839 0.3149 5.4526 14.3132
< 2.2e-16 6.363e-12 2.137e-09 0.3561201 0.7653963 0.3819282 3.915e-05 1.482e-09 1.968e-11 7.706e-08 0.8145660 0.0012481 0.0002007
*** *** ***
*** *** *** *** ** ***
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area Df
category 6 age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 garage 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0
+ zone Sum of Sq 3.7898e+10 1.1794e+10 2.1117e+08 2.4153e+09 1.0513e+09 1.5083e+10 2.8406e+10 3.2856e+10 2.0964e+10 7.0112e+09 1.2792e+10 7.0505e+09
RSS 1.8010e+11 1.4221e+11 1.6831e+11 1.7989e+11 1.7769e+11 1.7905e+11 1.6502e+11 1.5170e+11 1.4725e+11 1.5914e+11 1.7309e+11 1.6731e+11 1.7305e+11
AIC 4524.0 4484.5 4511.3 4525.8 4523.1 4528.8 4511.0 4488.6 4482.1 4499.1 4521.4 4514.0 4517.3
F value
Pr(>F)
8.3504 13.5237 0.2266 2.6235 0.3738 5.8192 36.1406 43.0642 25.4251 2.5789 4.8678 7.8632
4.949e-08 0.0003053 0.6346290 0.1069270 0.7719787 0.0007961 9.014e-09 4.758e-10 1.056e-06 0.0549525 0.0027611 0.0055610
*** ***
*** *** *** *** . ** **
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model:
K14521_SM-Color_Cover.indd 633
30/06/15 11:53 am
628
Probability and Statistics with R, Second Edition: Exercises and Solutions
totalprice ~ area Df
category 6 age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0
+ zone + garage Sum of Sq RSS 1.4725e+11 2.6933e+10 1.2032e+11 8.0663e+09 1.3918e+11 2.8860e+08 1.4696e+11 2.2030e+09 1.4505e+11 2.0387e+09 1.4521e+11 9.5253e+09 1.3772e+11 1.6516e+10 1.3073e+11 1.6358e+10 1.3089e+11 6.5779e+09 1.4067e+11 1.2552e+10 1.3470e+11 5.1715e+09 1.4208e+11
AIC 4482.1 4450.1 4471.9 4483.7 4480.8 4485.1 4473.6 4458.2 4458.5 4478.2 4468.7 4476.3
F value
Pr(>F)
6.9769 11.1273 0.3771 2.9162 0.8892 4.3803 24.2567 23.9950 2.9616 5.9016 6.9886
1.040e-06 0.0010212 0.5399126 0.0893099 0.4477641 0.0052350 1.813e-06 2.046e-06 0.0334679 0.0007162 0.0088808
*** ** . ** *** *** * *** **
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area Df
age 1 floor 1 rooms 1 out 3 conservation 3 toilets 1 elevator 1 streetcategory 3 heating 3 storage 1 --Signif. codes: 0
+ zone + garage + category Sum of Sq RSS AIC F value Pr(>F) 1.2032e+11 4450.1 1.7261e+09 1.1859e+11 4448.9 2.7073 0.101580 9.2505e+05 1.2031e+11 4452.1 0.0014 0.969875 2.5576e+09 1.1776e+11 4447.4 4.0398 0.045883 * 5.6366e+09 1.1468e+11 4445.6 3.0146 0.031318 * 1.5615e+09 1.1875e+11 4453.2 0.8065 0.491749 6.6314e+09 1.1368e+11 4439.7 10.8497 0.001183 ** 1.0192e+10 1.1012e+11 4432.8 17.2148 5.074e-05 *** 5.2963e+09 1.1502e+11 4446.3 2.8242 0.040097 * 7.3042e+09 1.1301e+11 4442.4 3.9641 0.009074 ** 4.2321e+09 1.1608e+11 4444.3 6.7812 0.009957 ** '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area + zone + garage + category + elevator Df Sum of Sq RSS AIC F value Pr(>F)
1.1012e+11 4432.8 age 1 815586409 1.0931e+11 4433.2 1.3804 0.241550 floor 1 27424777 1.1010e+11 4434.7 0.0461 0.830261 rooms 1 1548114686 1.0857e+11 4431.7 2.6378 0.106049 out 3 4328922592 1.0579e+11 4430.1 2.4960 0.061298 .
K14521_SM-Color_Cover.indd 634
30/06/15 11:53 am
Chapter 12: conservation toilets streetcategory heating storage --Signif. codes:
3 1 3 3 1
1689461578 6073243761 5617271088 5056597855 3694847210
Regression
1.0843e+11 1.0405e+11 1.0451e+11 1.0507e+11 1.0643e+11
629
4435.4 0.9504 0.417449 4422.4 10.7982 0.001216 ** 4427.4 3.2788 0.022218 * 4428.6 2.9358 0.034710 * 4427.4 6.4226 0.012097 *
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area Df
age 1 floor 1 rooms 1 out 3 conservation 3 streetcategory 3 heating 3 storage 1 --Signif. codes: 0
+ zone + garage + category + elevator + toilets Sum of Sq RSS AIC F value Pr(>F) 1.0405e+11 4422.4 822533955 1.0323e+11 4422.7 1.4661 0.22751 29736792 1.0402e+11 4424.4 0.0526 0.81885 1211766335 1.0284e+11 4421.9 2.1681 0.14261 4583711290 9.9466e+10 4418.6 2.7957 0.04164 * 1482925524 1.0257e+11 4425.3 0.8771 0.45405 5418285673 9.8631e+10 4416.8 3.3327 0.02072 * 4760850186 9.9289e+10 4418.2 2.9089 0.03596 * 3309776001 1.0074e+11 4417.4 6.0453 0.01487 * '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: totalprice ~ area storage Df
age 1 floor 1 rooms 1 out 3 conservation 3 streetcategory 3 heating 3 --Signif. codes: 0
+ zone + garage + category + elevator + toilets + Sum of Sq 372189171 21742002 1463125127 4307912248 1367752167 3975957762 4211586990
RSS 1.0074e+11 1.0037e+11 1.0072e+11 9.9277e+10 9.6432e+10 9.9372e+10 9.6764e+10 9.6528e+10
AIC F value Pr(>F) 4417.4 4418.6 0.6786 0.41114 4419.3 0.0395 0.84267 4416.2 2.6970 0.10225 4413.9 2.6953 0.04744 * 4420.4 0.8304 0.47872 4414.6 2.4791 0.06269 . 4414.1 2.6324 0.05145 .
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions
K14521_SM-Color_Cover.indd 635
30/06/15 11:53 am
630
Probability and Statistics with R, Second Edition: Exercises and Solutions
Model: totalprice ~ area storage + out Df
age 1 floor 1 rooms 1 conservation 3 streetcategory 3 heating 3 --Signif. codes: 0
+ zone + garage + category + elevator + toilets + Sum of Sq 13949063 15157903 1036260293 866974136 3976103810 4728065085
RSS 9.6432e+10 9.6418e+10 9.6417e+10 9.5396e+10 9.5565e+10 9.2456e+10 9.1704e+10
AIC F value Pr(>F) 4413.9 4415.8 0.0260 0.87198 4415.8 0.0283 0.86660 4413.5 1.9553 0.16374 4417.9 0.5383 0.65666 4410.7 2.5517 0.05715 . 4408.9 3.0591 0.02964 *
'***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> formula(mod.fs) totalprice ~ area + zone + garage + category + elevator + toilets + storage + out > modelD + > + > > >
modelD
set.seed(5) cv.error5 > > >
residualPlot(modelA, residualPlot(modelB, residualPlot(modelC, residualPlot(modelD,
main main main main
= = = =
"Model "Model "Model "Model
A") B") C") D") Model B 60000 20000 −60000
−20000
Pearson residuals
20000 −20000 −60000
Pearson residuals
60000
Model A
250000
350000
450000
150000
250000
350000
Fitted values
Fitted values
Model C
Model D
450000
250000
350000
0 20000 −80000
150000
−40000
Pearson residuals
0 −50000
Pearson residuals
50000
60000
150000
450000
150000
250000
Fitted values
350000
450000
Fitted values
The residuals versus the fitted values for Models (A), (B), (C), and (D) all have a definite curvature indicating none of the models are adequate. (e) > > > >
K14521_SM-Color_Cover.indd 637
boxCox(modelA, boxCox(modelB, boxCox(modelC, boxCox(modelD,
lambda lambda lambda lambda
= = = =
seq(-0.5, seq(-0.5, seq(-0.5, seq(-0.5,
0.5, 0.5, 0.5, 0.5,
length length length length
= = = =
200)) 200)) 200)) 200))
30/06/15 11:53 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
−2728 −2730
95%
−2734
−2732
log−Likelihood
−2735
−2733
95%
−2737
log−Likelihood
−2731
632
−0.4
−0.2
0.0
0.2
0.4
−0.4
−0.2
0.0
0.4
0.2
0.4
−2737
λ
−2739
95%
−2741
−2759
log−Likelihood
95%
−2761
log−Likelihood
−2757
λ
0.2
−0.4
−0.2
0.0
0.2
0.4
−0.4
−0.2
0.0
λ
λ
A log transformation is suggested for the response totalprice in each model. Model (E) The functions drop1() and update() are used to create a model using backward elimination. > VIT2005$logtotalprice model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + age + floor out + conservation + toilets + garage + elevator heating + storage Df Sum of Sq RSS AIC F value
0.97895 -1080.46 area 1 0.45808 1.43703 -998.78 79.0794 zone 22 1.25852 2.23747 -944.25 9.8756
K14521_SM-Color_Cover.indd 638
+ rooms + + streetcategory + Pr(>F) 8.788e-16 *** < 2.2e-16 ***
30/06/15 11:53 am
Chapter 12: category age floor rooms out conservation toilets garage elevator streetcategory heating storage --Signif. codes:
6 1 1 1 3 3 1 1 1 3 3 1
0.10125 0.00061 0.00185 0.00422 0.06721 0.01134 0.08460 0.14304 0.14529 0.03040 0.04007 0.03280
1.08020 0.97956 0.98080 0.98317 1.04616 0.99029 1.06356 1.12199 1.12424 1.00935 1.01902 1.01175
Regression
633
-1071.00 2.9132 0.009938 ** -1082.32 0.1045 0.746909 -1082.05 0.3191 0.572909 -1081.52 0.7279 0.394766 -1071.98 3.8675 0.010426 * -1083.95 0.6523 0.582546 -1064.39 14.6057 0.000186 *** -1052.73 24.6935 1.636e-06 *** -1052.29 25.0826 1.373e-06 *** -1079.79 1.7492 0.158880 -1077.71 2.3060 0.078550 . -1075.27 5.6625 0.018445 *
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + floor + rooms + out + conservation + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
0.97956 -1082.32 area 1 0.46618 1.44574 -999.46 80.9051 4.535e-16 *** zone 22 1.26736 2.24692 -945.34 9.9977 < 2.2e-16 *** category 6 0.10075 1.08031 -1072.98 2.9142 0.0099017 ** floor 1 0.00202 0.98158 -1083.87 0.3509 0.5543866 rooms 1 0.00422 0.98378 -1083.39 0.7322 0.3933744 out 3 0.06702 1.04658 -1073.90 3.8772 0.0102869 * conservation 3 0.01128 0.99084 -1085.83 0.6525 0.5824434 toilets 1 0.08451 1.06406 -1066.28 14.6657 0.0001803 *** garage 1 0.14340 1.12296 -1054.54 24.8869 1.492e-06 *** elevator 1 0.14688 1.12643 -1053.87 25.4902 1.136e-06 *** streetcategory 3 0.03151 1.01107 -1081.42 1.8229 0.1448410 heating 3 0.04185 1.02140 -1079.20 2.4207 0.0678034 . storage 1 0.03220 1.01175 -1077.27 5.5875 0.0192183 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + floor + rooms + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
K14521_SM-Color_Cover.indd 639
30/06/15 11:53 am
634
Probability and Statistics with R, Second Edition: Exercises and Solutions
0.99084 -1085.8 area 1 0.46386 1.45470 -1004.1 zone 22 1.27513 2.26596 -949.5 category 6 0.12795 1.11878 -1071.3 floor 1 0.00212 0.99295 -1087.4 rooms 1 0.00384 0.99468 -1087.0 out 3 0.07314 1.06397 -1076.3 toilets 1 0.08930 1.08013 -1069.0 garage 1 0.15229 1.14313 -1056.7 elevator 1 0.14955 1.14039 -1057.2 streetcategory 3 0.03302 1.02385 -1084.7 heating 3 0.03941 1.03025 -1083.3 storage 1 0.03105 1.02189 -1081.1 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*'
80.9902 10.1199 3.7233 0.3700 0.6706 4.2565 15.5913 26.5903 26.1122 1.9216 2.2939 5.4221
3.978e-16 < 2.2e-16 0.0016539 0.5437901 0.4139769 0.0062592 0.0001142 6.824e-07 8.459e-07 0.1278696 0.0796834 0.0210400
*** *** **
** *** *** *** . *
0.05 '.' 0.1 ' ' 1
> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + rooms + out + toilets garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
0.99295 -1087.36 area 1 0.46227 1.45522 -1006.04 81.0051 3.828e-16 zone 22 1.32477 2.31772 -946.57 10.5521 < 2.2e-16 category 6 0.13186 1.12482 -1072.18 3.8512 0.0012404 rooms 1 0.00390 0.99685 -1088.51 0.6826 0.4098210 out 3 0.07320 1.06615 -1077.86 4.2757 0.0060983 toilets 1 0.08971 1.08266 -1070.51 15.7202 0.0001071 garage 1 0.15125 1.14420 -1058.45 26.5036 7.058e-07 elevator 1 0.15179 1.14474 -1058.35 26.5984 6.764e-07 streetcategory 3 0.03140 1.02435 -1086.57 1.8340 0.1427479 heating 3 0.04104 1.03400 -1084.53 2.3975 0.0697654 storage 1 0.02921 1.02217 -1083.04 5.1190 0.0249012 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+
*** *** ** ** *** *** *** . *
> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + heating + storage Df Sum of Sq RSS AIC F value Pr(>F)
0.99685 -1088.51 area 1 0.63037 1.62722 -983.68 110.6630 < 2.2e-16 *** zone 22 1.32105 2.31790 -948.56 10.5415 < 2.2e-16 ***
K14521_SM-Color_Cover.indd 640
30/06/15 11:53 am
Chapter 12: category out toilets garage elevator streetcategory heating storage --Signif. codes:
6 3 1 1 1 3 3 1
0.12882 0.07696 0.09245 0.15205 0.15775 0.03042 0.04262 0.02789
1.12567 1.07381 1.08930 1.14890 1.15460 1.02727 1.03947 1.02474
Regression -1074.01 -1078.30 -1071.17 -1059.56 -1058.48 -1087.95 -1085.38 -1084.49
635
3.7690 0.001487 ** 4.5036 0.004526 ** 16.2297 8.353e-05 *** 26.6921 6.452e-07 *** 27.6935 4.121e-07 *** 1.7802 0.152691 2.4940 0.061618 . 4.8962 0.028210 *
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> model.be drop1(model.be, test = "F") Single term deletions Model: logtotalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + storage Df Sum of Sq RSS AIC F value Pr(>F)
1.0395 -1085.38 area 1 0.67195 1.7114 -978.68 115.0658 < 2.2e-16 *** zone 22 1.29885 2.3383 -952.64 10.1099 < 2.2e-16 *** category 6 0.15896 1.1984 -1066.36 4.5369 0.0002631 *** out 3 0.07346 1.1129 -1076.50 4.1930 0.0067656 ** toilets 1 0.09746 1.1369 -1067.84 16.6896 6.648e-05 *** garage 1 0.14218 1.1817 -1059.43 24.3479 1.838e-06 *** elevator 1 0.19646 1.2359 -1049.64 33.6422 2.967e-08 *** streetcategory 3 0.03627 1.0757 -1083.90 2.0703 0.1058248 storage 1 0.03132 1.0708 -1080.91 5.3633 0.0217043 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > formula(model.be) logtotalprice ~ area + zone + category + out + toilets + garage + elevator + streetcategory + storage > modelE modelEg cv.errorN CVNe CVNe [1] 0.007113846 > set.seed(5)
K14521_SM-Color_Cover.indd 641
30/06/15 11:53 am
636
Probability and Statistics with R, Second Edition: Exercises and Solutions
> cv.error5 CV5e CV5e [1] 0.007780027 The CV n = 0.0071 for Model (E), and CV 5 = 0.0078 for Model (E). (ii) > MGOF MGOF R2 0.91673861
R2.adj AIC BIC 0.89849594 -464.72399549 -325.95969791
SE 0.07641801
The R2 , Ra2 , AIC, BIC, and standard error for modelE are 0.9167, 0.8985, -464.724, 325.9597, and 0.0764, respectively. The total proportion of variability explained by modelE is 0.9167. Model (F) The function stepAIC() from the MASS package is used find a model using the AIC criterion. > SCOPE mod.fs modelF formula(modelF)
K14521_SM-Color_Cover.indd 646
30/06/15 11:53 am
Chapter 12:
Regression
641
logtotalprice ~ area + zone + elevator + toilets + garage + category + out + storage + heating + streetcategory The AIC criterion suggests a model with variables area, zone, elevator, toilets, garage, category, out, storage, heating, and streetcategory. (i) > > > >
modelFg >
set.seed(5) cv.error5 SCOPE mod.fs modelG formula(modelG) logtotalprice ~ area + elevator + garage + zone + toilets + storage The BIC criterion suggests a model with variables area, elevator, garage, zone, toilets, and storage. (i) > > > >
modelGg >
set.seed(5) cv.error5 SCOPE mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ 1 Df Sum of Sq RSS AIC F value Pr(>F)
12.4844 -621.48 area 1 7.8362 4.6482 -834.87 364.1448 < 2.2e-16 *** zone 22 7.5626 4.9218 -780.40 13.6196 < 2.2e-16 *** category 6 4.8773 7.6071 -717.48 22.5468 < 2.2e-16 *** age 1 1.2432 11.2412 -642.35 23.8875 1.990e-06 *** floor 1 0.0307 12.4537 -620.02 0.5324 0.4664 rooms 1 3.3932 9.0912 -688.63 80.6198 < 2.2e-16 *** out 3 0.1954 12.2890 -618.92 1.1344 0.3361 conservation 3 1.3130 11.1715 -639.71 8.3836 2.705e-05 *** toilets 1 6.1982 6.2862 -769.06 212.9786 < 2.2e-16 *** garage 1 3.3694 9.1150 -688.06 79.8464 < 2.2e-16 *** elevator 1 4.0709 8.4135 -705.52 104.5141 < 2.2e-16 *** streetcategory 3 1.3194 11.1650 -639.83 8.4299 2.548e-05 *** heating 3 2.0692 10.4152 -654.99 14.1722 1.850e-08 *** storage 1 1.0175 11.4669 -638.02 19.1668 1.866e-05 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area
K14521_SM-Color_Cover.indd 651
30/06/15 11:53 am
646
Probability and Statistics with R, Second Edition: Exercises and Solutions
Df Sum of Sq RSS
4.6482 zone 22 2.38890 2.2593 category 6 1.39298 3.2552 age 1 0.94091 3.7073 floor 1 0.00287 4.6453 rooms 1 0.00522 4.6430 out 3 0.06955 4.5787 conservation 3 0.56107 4.0871 toilets 1 0.89664 3.7516 garage 1 0.82734 3.8209 elevator 1 0.98718 3.6610 streetcategory 3 0.01528 4.6329 heating 3 0.41740 4.2308 storage 1 0.35116 4.2970 --Signif. codes: 0 '***' 0.001 '**'
AIC -834.87 -948.14 -900.52 -882.17 -833.00 -833.11 -832.15 -856.91 -879.59 -875.60 -884.91 -829.58 -849.38 -849.99
F value
Pr(>F)
9.3240 14.9772 54.5666 0.1327 0.2418 1.0785 9.7466 51.3859 46.5544 57.9741 0.2342 7.0048 17.5699
< 2.2e-16 3.035e-14 3.274e-12 0.7159579 0.6234242 0.3591391 4.708e-06 1.200e-11 8.921e-11 8.291e-13 0.8724910 0.0001628 4.046e-05
*** *** ***
*** *** *** *** *** ***
0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone Df Sum of Sq RSS
2.2593 category 6 0.49734 1.7620 age 1 0.15632 2.1030 floor 1 0.00580 2.2535 rooms 1 0.02241 2.2369 out 3 0.02187 2.2374 conservation 3 0.19686 2.0624 toilets 1 0.40710 1.8522 garage 1 0.36212 1.8972 elevator 1 0.42018 1.8391 streetcategory 3 0.06860 2.1907 heating 3 0.17104 2.0883 storage 1 0.11314 2.1462 --Signif. codes: 0 '***' 0.001 '**'
AIC -948.14 -990.34 -961.77 -946.70 -948.31 -944.26 -962.01 -989.45 -984.22 -991.00 -948.86 -959.30 -957.34
F value
Pr(>F)
8.8444 14.3460 0.4964 1.9331 0.6223 6.0770 42.4204 36.8384 44.0942 1.9937 5.2147 10.1740
1.685e-08 0.0002031 0.4819483 0.1660184 0.6014037 0.0005689 6.230e-10 6.671e-09 3.098e-10 0.1163357 0.0017533 0.0016624
*** ***
*** *** *** *** ** **
0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets Df Sum of Sq RSS AIC F value Pr(>F)
1.8522 -989.45 category 6 0.26912 1.5831 -1011.68 5.2984 4.544e-05 ***
K14521_SM-Color_Cover.indd 652
30/06/15 11:53 am
Chapter 12: age floor rooms out conservation garage elevator streetcategory heating storage --Signif. codes:
1 1 1 3 3 1 1 3 3 1
0.10365 0.00599 0.01137 0.04104 0.10036 0.21264 0.32835 0.05361 0.12343 0.08179
Regression
647
1.7486 -1000.00 11.3811 0.0008975 *** 1.8462 -988.16 0.6226 0.4310439 1.8408 -988.79 1.1855 0.2776021 1.8112 -988.33 1.4350 0.2339420 1.7518 -995.59 3.6282 0.0140205 * 1.6396 -1014.03 24.9012 1.348e-06 *** 1.5238 -1029.99 41.3709 9.772e-10 *** 1.7986 -989.85 1.8878 0.1330509 1.7288 -998.49 4.5220 0.0043477 ** 1.7704 -997.30 8.8701 0.0032727 **
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets + elevator Df Sum of Sq RSS AIC F value Pr(>F)
1.5238 -1030.0 category 6 0.178990 1.3449 -1045.2 4.1258 0.0006487 age 1 0.048398 1.4754 -1035.0 6.2652 0.0131521 floor 1 0.001602 1.5223 -1028.2 0.2009 0.6544680 rooms 1 0.001585 1.5223 -1028.2 0.1989 0.6561194 out 3 0.037545 1.4863 -1029.4 1.5914 0.1928886 conservation 3 0.081656 1.4422 -1036.0 3.5670 0.0151983 garage 1 0.178658 1.3452 -1055.2 25.3672 1.093e-06 streetcategory 3 0.062161 1.4617 -1033.1 2.6792 0.0482952 heating 3 0.092733 1.4311 -1037.7 4.0822 0.0077429 storage 1 0.070407 1.4534 -1038.3 9.2524 0.0026825 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' '
*** *
* *** * ** ** 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone Df Sum of Sq
category 6 0.142650 age 1 0.036534 floor 1 0.002271 rooms 1 0.002035 out 3 0.045758 conservation 3 0.061414 streetcategory 3 0.058889 heating 3 0.094803
K14521_SM-Color_Cover.indd 653
+ toilets + elevator + RSS AIC F value 1.3452 -1055.2 1.2025 -1067.6 3.6576 1.3087 -1059.2 5.3042 1.3429 -1053.5 0.3213 1.3432 -1053.5 0.2879 1.2994 -1056.7 2.2067 1.2838 -1059.4 2.9979 1.2863 -1058.9 2.8690 1.2504 -1065.1 4.7513
garage Pr(>F) 0.001865 0.022357 0.571477 0.592210 0.088738 0.031950 0.037776 0.003227
** *
. * * **
30/06/15 11:53 am
648
Probability and Statistics with R, Second Edition: Exercises and Solutions
storage --Signif. codes:
1
0.060346 1.2849 -1063.2
8.9239 0.003186 **
0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets + elevator + garage + category Df Sum of Sq RSS AIC F value Pr(>F)
1.2025 -1067.6 age 1 0.006361 1.1962 -1066.8 0.9784 0.323890 floor 1 0.000015 1.2025 -1065.6 0.0024 0.961383 rooms 1 0.006518 1.1960 -1066.8 1.0028 0.317944 out 3 0.078212 1.1243 -1076.3 4.2202 0.006504 ** conservation 3 0.015357 1.1872 -1064.4 0.7847 0.503852 streetcategory 3 0.055147 1.1474 -1071.8 2.9158 0.035638 * heating 3 0.053972 1.1486 -1071.6 2.8508 0.038772 * storage 1 0.052809 1.1497 -1075.4 8.4514 0.004096 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > mod.fs add1(mod.fs, scope = SCOPE, test = "F") Single term additions Model: logtotalprice ~ area + zone + toilets + elevator + garage + category + storage Df Sum of Sq RSS AIC F value Pr(>F)
1.1497 -1075.4 age 1 0.001743 1.1480 -1073.7 0.2779 0.598711 floor 1 0.002001 1.1477 -1073.8 0.3190 0.572911 rooms 1 0.008907 1.1408 -1075.1 1.4288 0.233502 out 3 0.073994 1.0757 -1083.9 4.1500 0.007136 ** conservation 3 0.015323 1.1344 -1072.3 0.8149 0.487130 streetcategory 3 0.036807 1.1129 -1076.5 1.9953 0.116308 heating 3 0.045544 1.1042 -1078.2 2.4886 0.061928 . --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > formula(mod.fs) logtotalprice ~ area + zone + toilets + elevator + garage + category + storage
Forward selection selects the variables area, zone, toilets, elevator, garage, category, and storage.
K14521_SM-Color_Cover.indd 654
30/06/15 11:53 am
Chapter 12:
Regression
649
(i) > > > > >
modelH
set.seed(5) cv.error5 CVNS CVNS CVNe CVNf CVNg CVNh 0.007113846 0.007080105 0.007796288 0.007481615 > which.min(CVNS) CVNf 2 > CV5S CV5S CV5e CV5f CV5g CV5h 0.007780027 0.007878634 0.008180095 0.008046479 > which.min(CV5S) CV5e 1
K14521_SM-Color_Cover.indd 655
30/06/15 11:54 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
650
Model (F) has the smallest CV 5 = 0.0079 as well as the smallest CV n = 0.0071 error. (g) > residualPlots(modelF)
140
180
Z35
0.0
1.8
1.0
1.5
1.0
0.2 0.1
2.0
2A
3A
4A
5A
0.1
Pearson residuals
0.2
category
−0.2
−0.2 out
E75
0.8
−0.2 0.5
0.0
Pearson residuals
0.2
E50
0.6
0.0
Pearson residuals
0.2 0.0
garage
0.1
E25
0.4
elevator
0.1
2.0
0.0 E100
0.2
0.2
1.6
0.2
Z56
−0.2 1.4
−0.2
Pearson residuals
Z47
0.0
Pearson residuals
0.2 0.1 0.0
1.2
toilets
K14521_SM-Color_Cover.indd 656
Z42 zone
−0.2
Pearson residuals
area
1.0
0.1 −0.2
Z11
0.1
100
0.0
60
0.0
Pearson residuals
0.1 −0.2
0.0
Pearson residuals
0.2 0.1 0.0
Pearson residuals
−0.2
0.2
Test stat Pr(>|t|) -1.621 0.107 NA NA -0.393 0.695 1.785 0.076 0.426 0.671 NA NA NA NA 0.879 0.380 NA NA NA NA 0.495 0.621
area zone elevator toilets garage category out storage heating streetcategory Tukey test
0.0
0.5
1.0 storage
1.5
2.0
1A
3A
3B
4A
heating
30/06/15 11:54 am
651
0.1
0.2
Regression
−0.2
0.0
Pearson residuals
0.2 0.1 0.0 −0.2
Pearson residuals
Chapter 12:
S2
S3
S4
streetcategory
S5
12.0
12.4
12.8
13.2
Fitted values
Assumptions with respect to the residuals seem to be satisfied with Model (F). Model (I) Model (F) is assigned to the object modelI.
> modelI influenceIndexPlot(modelI, id.n = 3) > outlierTest(modelI) rstudent unadjusted p-value Bonferonni p 93 4.250659 3.4698e-05 0.0075643
K14521_SM-Color_Cover.indd 657
30/06/15 11:54 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
652
Diagnostic Plots Cook's distance 0.04 0.08 0.12
93
31
0.00
160
Studentized residuals −2 0 2 4
93
156
Bonferroni p−value 0.4 0.8
3
156 3
0.0
93 92
hat−values 0.3 0.5
9
0.1
185
0
50
100
150
200
Index
Observation 93 is an outlier according to the Bonferroni outlier test. (ii)
> influencePlot(modelI, id.n = 3) 3 9 31 92 93 156 160 185
K14521_SM-Color_Cover.indd 658
StudRes -3.23941807 -0.79653792 -1.86226419 0.02263487 4.25065863 -3.09043662 2.11225702 0.74383798
Hat 0.1243814 0.5268393 0.3717995 0.6218219 0.2457917 0.1530896 0.3116083 0.4394763
CookD 0.18133476 0.12831016 0.21695582 0.00443887 0.35322493 0.19565031 0.21460798 0.10057032
30/06/15 11:54 am
Chapter 12:
653
4
93
2
160
185 0
92 9 31
−2
Studentized Residuals
Regression
3 0.1
156 0.2
0.3
0.4
0.5
0.6
Hat−Values
Observation 93 has the largest Cook’s distance value and is an influential observation. (iii) Removing observations 3, and 93 from Model (I). > modelI checking.plots(modelI) > shapiro.test(resid(modelI)) Shapiro-Wilk normality test data: resid(modelI) W = 0.99441, p-value = 0.6049
K14521_SM-Color_Cover.indd 659
30/06/15 11:54 am
Probability and Statistics with R, Second Edition: Exercises and Solutions
654
100
150
2 0
207
207 156
200
−2
0
2
ordered values
Theoretical Quantiles
Standardized residuals versus fitted values for modelI
Density plot of standardized residuals for modelI
0.2 0.1
0
Density
2
0.3
116
−2
standardized residuals
50
116
−2
2 0
156 0
Normal Q−Q plot of standardized residuals from modelI standardized residuals
116
−2
standardized residuals
Standardized residuals versus ordered values for modelI
12.0
12.4
12.8
0.0
207 156 13.2
fitted values
−4
−2
0
2
N = 216 Bandwidth = 0.3069
The normality assumptions for the residuals appear to be satisfied. (v) > vif(modelI) GVIF Df GVIF^(1/(2*Df)) area 3.007687 1 1.734268 zone 178.394245 22 1.125039 elevator 2.278083 1 1.509332 toilets 3.438421 1 1.854298 garage 1.829758 1 1.352686 category 13.719834 6 1.243882 out 2.384567 3 1.155850 storage 1.463063 1 1.209571 heating 5.433721 3 1.325917 streetcategory 7.124974 3 1.387173
K14521_SM-Color_Cover.indd 660
30/06/15 11:54 am
Chapter 12:
Regression
655
Multicollinearity is not a problem with Model (I). The decrease in precision of estimation due to multicollinearity is less than 1.855 for all variables. (vi) The coefficients for Model (I) and 95% confidence intervals for the parameters of Model (I) are computed in R Code (12.4). R Code 12.4 > coef(summary(modelI)) (Intercept) area zoneZ21 zoneZ31 zoneZ32 zoneZ34 zoneZ35 zoneZ36 zoneZ37 zoneZ38 zoneZ41 zoneZ42 zoneZ43 zoneZ44 zoneZ45 zoneZ46 zoneZ47 zoneZ48 zoneZ49 zoneZ52 zoneZ53 zoneZ56 zoneZ61 zoneZ62 elevator toilets garage category2B category3A category3B category4A category4B category5A outE25 outE50 outE75 storage heating3A heating3B heating4A streetcategoryS3
K14521_SM-Color_Cover.indd 661
Estimate 11.7524914839 0.0041496205 0.3802200972 0.3389852061 0.2403972342 0.1681132305 0.3393209999 0.2217815708 0.3585080961 0.1927022820 0.2707262908 0.3337865477 0.2292942872 0.1956903684 0.1490659068 0.1346932202 0.0632316000 0.2299657516 0.2015766920 0.1346092984 0.1659316744 0.2017329937 0.1586852151 0.1298976193 0.1035564143 0.0858677405 0.0759192863 -0.0117734068 -0.0661178646 -0.0870663644 -0.1186452274 -0.1492007232 -0.1992746969 0.1707890114 -0.0002797115 0.0407193190 0.0376288633 -0.0040182917 -0.0097999797 0.0347673493 0.0209412769
Std. Error t value Pr(>|t|) 0.0778168104 151.02766898 1.300977e-185 0.0003977202 10.43351652 4.449800e-20 0.0435551810 8.72961812 2.102581e-15 0.0450382345 7.52660955 2.739747e-12 0.0404942389 5.93657865 1.553304e-08 0.0503495797 3.33892024 1.029979e-03 0.0483376686 7.01980484 4.848350e-11 0.0409454210 5.41651704 2.010000e-07 0.0420635177 8.52301747 7.429520e-15 0.0510651531 3.77365523 2.207769e-04 0.0392701711 6.89394223 9.740508e-11 0.0499736267 6.67925403 3.152129e-10 0.0507979499 4.51384923 1.173102e-05 0.0487539533 4.01383591 8.884359e-05 0.0418746595 3.55981179 4.792959e-04 0.0441661971 3.04969024 2.651214e-03 0.0461406978 1.37040840 1.723350e-01 0.0454032057 5.06496729 1.039851e-06 0.0480320286 4.19671411 4.322100e-05 0.0424127766 3.17379123 1.781088e-03 0.0419921817 3.95148972 1.129719e-04 0.0475626099 4.24141977 3.611259e-05 0.0397582510 3.99125241 9.695293e-05 0.0405609255 3.20253095 1.621587e-03 0.0179882299 5.75689851 3.825401e-08 0.0176860476 4.85511192 2.676492e-06 0.0142635424 5.32261088 3.140212e-07 0.0456010371 -0.25818287 7.965727e-01 0.0444057487 -1.48894831 1.383217e-01 0.0456240182 -1.90834494 5.800261e-02 0.0473737485 -2.50445091 1.318946e-02 0.0496295095 -3.00629051 3.038333e-03 0.0779315318 -2.55704838 1.141526e-02 0.0451573944 3.78208295 2.139872e-04 0.0125073847 -0.02236371 9.821836e-01 0.0323878425 1.25724086 2.103611e-01 0.0143838096 2.61605683 9.681683e-03 0.0373211302 -0.10766801 9.143838e-01 0.0476691945 -0.20558308 8.373583e-01 0.0393575474 0.88337185 3.782613e-01 0.0181382026 1.15453980 2.498712e-01
30/06/15 11:54 am
656
Probability and Statistics with R, Second Edition: Exercises and Solutions
streetcategoryS4 0.0203906494 0.0196300252 streetcategoryS5 -0.0187018119 0.0324744025
1.03874800 -0.57589395
3.003716e-01 5.654353e-01
> confint(modelI) (Intercept) area zoneZ21 zoneZ31 zoneZ32 zoneZ34 zoneZ35 zoneZ36 zoneZ37 zoneZ38 zoneZ41 zoneZ42 zoneZ43 zoneZ44 zoneZ45 zoneZ46 zoneZ47 zoneZ48 zoneZ49 zoneZ52 zoneZ53 zoneZ56 zoneZ61 zoneZ62 elevator toilets garage category2B category3A category3B category4A category4B category5A outE25 outE50 outE75 storage heating3A heating3B heating4A streetcategoryS3 streetcategoryS4 streetcategoryS5
2.5 % 11.598898894 0.003364612 0.294252129 0.250090030 0.160470866 0.068734673 0.243913495 0.140964672 0.275484331 0.091911346 0.193215953 0.235150036 0.129030750 0.099461213 0.066414904 0.047519246 -0.027839587 0.140350206 0.106772451 0.050896176 0.083048710 0.107855278 0.080211519 0.049839627 0.068051762 0.050959527 0.047766315 -0.101779427 -0.153764659 -0.177117744 -0.212150174 -0.247158027 -0.353093721 0.081658641 -0.024966429 -0.023206876 0.009238512 -0.077681669 -0.103888069 -0.042915450 -0.014859388 -0.018354532 -0.082798857
97.5 % 11.906084074 0.004934629 0.466188065 0.427880382 0.320323602 0.267491788 0.434728505 0.302598469 0.441531862 0.293493218 0.348236629 0.432423060 0.329557825 0.291919524 0.231716910 0.221867194 0.154302787 0.319581298 0.296380933 0.218322421 0.248814639 0.295610710 0.237158911 0.209955611 0.139061067 0.120775954 0.104072258 0.078232613 0.021528929 0.002985015 -0.025140281 -0.051243420 -0.045455673 0.259919382 0.024407006 0.104645514 0.066019214 0.069645085 0.084288110 0.112450148 0.056741941 0.059135831 0.045395233
(vii)
K14521_SM-Color_Cover.indd 662
30/06/15 11:54 am
Chapter 12:
Regression
657
> RC RC [1] 63.43 19.42
3.51
2.88
1.33
1.04
0.68
0.41
0.35
0.12
6.83
The relative contributions of area, zone, elevator, toilets, garage, category, out, storage, heating, and streetcategory to explaining the variability of log(totalprice) in Model (I) given in percentages are 63.43, 19.42, 3.51, 2.88, 1.33, 1.04, 0.68, 0.41, 0.35, and 0.12, respectively. (viii) The variable that explains the most variability in Model (I) is area (63.43%). (ix) Variables area and zone explain 82.85% of the variability in Model (I). (x) The without bias correction backtransformed predictions for Model (I) are listed beneath the fit column of the data frame OWBC, while the without bias correction backtransformed lower and upper confidence limits are listed beneath the columns labeled lwr and upr, respectively of the OWBC data frame. The bias corrected backtransformed predictions for Model (I) (Ypred ) are listed beneath the Ytilde.pred column of the data frame OWBC, while the bias corrected backtransformed lower and upper confidence limits are listed beneath the columns l.inf and l.sup, respectively of the OWBC data frame. > + > > > + + > + + > + > >
Yhat.pred