792 88 59MB
English Pages 652 [678] Year 2006
Vfi
"( Ralph B. D'Agostino.
Sr.
Lisa M. Sullivan
Alexa
S.
Beiser
Introductory
Applied Biostatistics
A *
Digitized by the Internet Archive in
2012
http://www.archive.org/details/introductoryapplOOdago
Ralph D'Agostino
Sr. is Professor of Mathematics, Statistics, and Public Boston University. A respected and widely published statistician, he has more than 30 years of experience in running clinical trials and epidemiological research. He is a Senior Editor of STATISTICS IN MEDICINE, Fellow of the American Statistical Association and Epidemiologic Section of the American Heart Association. He was Chair (2003) of the American Statistical Association Section of Statistics in Epidemiology. He serves as Executive Director of Biometrics and Data Management for the Harvard Clinical Research Institute. His interests are in biostatistics and robust procedures, longitudinal data analysis, and
Health
at
numerous awards include the Food & Drug Administration Commissioner's Special Citation in 1981 and 1995. He is Director of Data Management and Statistical Analysis for the Framingham Heart Study that for more than 50 years has searched for
multivariate data. Dr. D'Agostinois
common
factors that contribute to cardiovascular disease.
author of five books
Lisa Sullivan
is
in
He
is
co-
various fields of statistical methodology.
an Associate Professor of
Biostatistics at the
Public Health, Associate Professor of Mathematics and
School of
Statistics
at
Boston University, and Assistant Dean for Undergraduate Education in Public Health at Boston University, where she received her MA and her PhD. She has won numerous awards for excellence in teaching. Her research interests include applied biostatistics, longitudinal data analysis, design and analysis of clinical trials, and hierarchical modeling. Dr. Sullivan spends most of her time in the Boston University Statistics and Consulting Unit working on the Framingham Heart Study. Her recent research focuses on developing health risk appraisal functions to quantify individuals' risks of developing cardiovascular disease. Her dozens of articles are published in prestigious periodicals such as the
JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, NEW ENGLAND JOURNAL OF MEDICINE, and STATISTICS IN MEDICINE. Away
from work, she enjoys running and cooking.
Alexa Beiser is Professor of Biostatistics in the School of Public Health at Boston University. She received her MA from University of California at San Diego and her PhD from Boston University. Her research interests include clinical trials methodology, statistical computing, and survival analysis. Dr. Beiser joined the Framingham Study in 1994 after spending many years collaborating on pediatric research projects. She primarily investigates risk factors for stroke, dementia, and Alzheimer's disease using Framingham Study data. Her foremost methodological interest is in estimation of lifetime risk of disease. Dr. Beiser has published articles in the NEW ENGLAND JOURNAL OF MEDICINE, the JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, STATISTICS IN MEDICINE, STROKE, and NEUROLOGY. She enjoys reading, traveling,
and spending time with her four
children.
Introductory Applied Biostatistics
Ralph
B.
D'Agostino,
Sr.
Boston University
Lisa M. Sullivan Boston University
S. Beiser Boston University
Alexa
THOMSON *
BROOKS/COLE Australia • I
(
anada
Kingdom
•
•
Mexico
•
Singapore
United States
•
Spain
THOIVISON *-
BROOKS/COLE Introductory Applied Biostatistics
Ralph
B.
D'Agostino,
Sr.,
Editor: Carolyn Crockett
Assistant Editor:
Lisa
M.
Sullivan, Alexa
S.
Beiser
Permissions Editor: Sarah Harkrader
Ann Day
Production Service: Matrix Productions/Merrill Peterson
Editorial Assistant: Daniel Geller
Text Designer: Carolyn Deacy
Technology Project Manager: Burke Taft
Copy
Editor: Pamela Rockwell Cover Designer: Simple Design/Denise Davidson Cover Image: PhotoDisc® Getty Images™
Marketing Manager: Stacy Best Marketing Assistant: Jessica Bothvvell Executive Marketing Communications Manager: Nathaniel
Bergson-Michelson Project Manager, Editorial Production: Kelsey
McGee
Printing,
Art Director: Lee Friedman Print Buver: Lisa
© 2006 Duxbury, part of
Compositor: Interactive Composition Corporation Cover Printer: Phoenix Color Corp
Cover
Printing,
and Binding: R.R. Donnelley/
Crawfordsville
Claudeanos
an imprint of Thomson Brooks/Cole, a
The Thomson Corporation. Thomson,
the Star logo,
and Brooks/Cole are trademarks used herein under
license.
Thomson Higher Education 10 Davis Drive CA 94002-3098
Belmont,
USA ALL RIGHTS RESERVED. No by the copyright hereon
form or by any means
may
part of this
be reproduced or used in any
—graphic,
electronic, or mechanical,
including photocopying, recording, taping, distribution, information storage in
any other manner
work covered
and
Web
retrieval systems, or
— without the written permission of
the publisher.
5 Shenton Way #01-01 UIC Building
Singapore 068808
Australia/New Zealand
Thomson Learning 102 Dodds Street
Printed in the United States of America 3 4 5 6 7
Asia (including India)
Thomson Learning
09 08 07 06
Australia
Southbank, Victoria 3006 Australia
For more information about our products, contact us
at:
Thomson Learning Academic Resource Center 1-800-423-0563 For permission to use material from
Canada
Thomson Nelson 1120 Birchmount Road
this text or
product,
submit a request online at http://www.thomsonrights.com,
Any additional questions about permissions can be submitted by email to [email protected].
Toronto, Ontario
M1K 5G4
Canada UK/Europe/Middle East/ Africa
Thomson Learning High Holborn House Library of Congress Control
Number: 20041 14720
50-51 Bedford
Row
London WC1R4LR United Kingdom ISBN 0-534-42399-X Latin America
Thomson Learning Seneca, 53
Colonia Polanco
11560 Mexico Mexico
D.F.
Spain (including Portugal)
Thomson
Paraninfo
Calle Magallanes, 25
28015 Madrid, Spain
Brief Contents
CHAPTER
Motivation
i
i
CHAPTER
2
Summarizing Data
CHAPTER
i
Probability 87
CHAPTER
4
Sampling Distributions 149
CHAPTER
5
Statistical Inference:
Procedures for
CHAPTER
6
Statistical Inference:
Procedures
for
(fi
t
-fi 2 )
15
fi
231
CHAPTER
7
Categorical Data 293
CHAPTER
8
Comparing Risks
CHAPTER
9
Analysis of Variance 407
CHAPTER
io
Correlation and Regression 465
CHAPTER
ii
Logistic Regression Analysis 507
CHAPTER
12
Nonparametric Tests 545
l
Introduction to Survival Analysis 585
CHAPTER
j
173
in
Two
Populations 359
in
Contents
Preface xi CHAPTER
i
Motivation
i
1.1
Vocabulary 2
1.2
Population Parameters 4
1.3
Sampling and Sample
1.4
Statistical Inference
CHAPTER
Statistics
2
Summarizing Data 2.1
2.4
iv
Vocabulary 17
2.1.2
Classification of Variables
2.1.3
Notation 18
17
Descriptive Statistics and Graphical 2.2.1
2.3
15
Background 17 2.1.1
2.2
7
10
Methods 19
Numerical Summaries for Continuous Variables 19
2.2.2
Graphical Summaries for Continuous Variables 39
2.2.3
Numerical Summaries for Discrete Variables 44
2.2.4
Graphical Summaries for Discrete Variables 51
Key Formulas 57 2.3.1.
Graphical Methods: Continuous Variables 58
2.3.2
Graphical Methods: Discrete Variables 59
Statistical
Computing 59
2.4.1
Continuous Variables 60
2.4.2
Discrete Variables 62
2.4.3
Summary
of
SAS Procedures 66
Contents
Framingham Heart Study Data 67
2.5
Analysis of
2.6
Problems 72
CHAPTER
3
Probability 87 3.1
Background 88
3.2
First Principles
3.3
Combinations and Permutations 98 The Binomial Distribution 103 The Normal Distribution 108 Percentiles of the Normal Distribution 118 3.5.1
Vocabulary 88
3.1.1
3.4 3.5
90
Normal Approximation
3.5.2 3.6
Key Formulas 124
3.7
Applications Using
Summary
3.7.1
120
SAS 125
SAS Functions 133 Framingham Heart Study Data 133
3.8
Analysis of
3.9
Problems 136
CHAPTER
to the Binomial
of
4
Sampling Distributions 149 4.1
Background 150
4.2
The Central Limit Theorem 152 Key Formulas 162 Applications Using SAS 162
4.3
4.4
Summary
4.4.1
4.5
of
SAS Functions 167
Problems 167
CHAPTER
5
Statistical Inference:
for 5.1
//,
Procedures
173
Estimating
/j
174
5.1.1
Vocabulary and Notation 174
5.1.2
Confidence Intervals for
5.1.3
Precision and
Sample
//
Size
177 Determination
1() years of age with diagnosed coronary .irtery disease. Had the patients been sampled from de-
population of patients JO \e.irs i age free of cardiovascular disease, these observed systolic blood pressures might be slightly higher than expected.
20
Chapter 2 Summarizing Data
700 instead, it would be impossisummarize the sample with respect to systolic blood pressure by inspecting the values. In fact, the same would probably be true if the sample included 20 subjects. In most applications, it is necessary to use statistical techniques to summarize a sample. Several of these are described next. To simplify[f
the sample included not 7 patients, but
ble to
the computations, each data element sure)
is
and the subscripts X,
=
It is
each observed systolic blood pres-
(i.e.,
represented by the variable X. Here
121
(i
—
X denotes systolic
1,2, ... ,7) denote the subject
X2 = 110 X3 = 114 X, =
100
X5 =
number
=
160 Xe
blood pressure,
in the sample:
130 X-
=
130
generally of interest to summarize a continuous variable with respect
to location. Location refers to the "center" of the data set
and addresses the
What is a typical systolic blood pressure? In the computations that follow, we can drop the subscripts, because the subject number (i.e., subscript) has no impact on them. To organize our calculations, we arrange the question,
data elements in a column, as shown. Notice that the data elements are or-
dered from smallest to largest nient)
and
(this
that the data element
is
130
not necessary but is
is
sometimes conve6 and 7 each
listed twice, as subjects
have systolic blood pressures of 130. X,
100 110 114 121
130 130 160
The first descriptive statistic we consider is the sample mean, denoted X ("X bar"). The sample mean is one statistic that summarizes the average value of a sample; it gives a sense of what a typical value looks like. To compute the sample mean, we sum all of the observations and divide by the sample size. The sample mean of the systolic blood pressures is
X=
(
100
+
110
In mathematics, the
symbol
sample mean of the
systolic
+114+
121
+
130
+
130
+
160)/7
^
(uppercase "sigma") denotes summation. The blood pressures can be represented as follows:
X=EX,/7 where
£X
=
100
+ 110+114+121 +
130 4-130
+
160
2.2 Descriptive Statistics
sample mean
In general, the
is
The mean elements, the
mean
>/
In the final expression of the formula, the subscript
summation
is
understood to be over
systolic
we
blood pressure
some of
see that
21
denoted:
;;
NOTE:
and Graphical Methods
all
is
/
is
suppressed and the
subjects in the sample.
X=
865/7
=
123.6. Reviewing the data
the observed systolic blood pressures are above
of 123.6, and others are below the mean. The
mean
of 123.6
is
in-
terpreted as the average, or typical, systolic blood pressure in the sample. In
and research reports, readers are generally not shown the summary statistics such as the sample size and
journal articles
actual data elements. Instead,
sample mean are provided. The sample mean is referred to as the balancing point, or pivot point, of the sample since the sum of the distances between observations below the mean and the sample mean is equal to the sum of the distances between observations above the mean and the sample mean (see dotplot in Figure 2.2). These distances or "deviations from the mean" are denoted (X— X). The following table displays the data elements along with their respective devia-
from the mean
tions
(i.e.,
distance from
X
(X-X)
100
-23.6
110
-13.6
114
-9.6
121
-2.6
1
JO
6.4
1
JO
6.4
160
J6.4
123.6):
-0.2*
865 * This
here
X=
is
sum
is
theoretically zero; the difference
due to rounding.
Figure 2.1 Sample Wean •
•
100
110
•
as Balancing Point
• 120
X
t
123.6
130
140
150
160
22
Chapter 2 Summarizing Data
The sample mean measures ple.
Location
the location, or central tendency of the samvery important in interpreting sample data. However, two
is
very different samples might produce the same sample mean. Consider a sec-
ond sample of 7
subjects from the population of patients 50 years of age with diagnosed coronary artery disease. Again we measure systolic blood pressures, in millimeters of mercury (mraHg), on each subject. The sample data are
120
12:
122
124
125
126
127
The sample mean for this sample is X=(l 20 +121 + 1 22 +124 +125 + 126 + 127)/7 = 865/7 = 123.6. The sizes and the means are the same in the two samples, yet the samples are quite different. For a more complete understanding of the data, we also need a measure of Measures of dispersion address
the dispersion, or variability, in the sample.
whether the data elements are Specifically,
we
tightly clustered together or widely spread.
are interested in whether the data elements are tightly clus-
mean or whether the data elements are widely spread above and below the mean. The goal is to generate an estimate of the dispersion in the sample, in particular the dispersion of the data elements about the sample mean. The deviations from the mean sum to zero, since the negative deviatered about the
tions "cancel out" the positive deviations (Figure 2.2).
Of
real interest
is
the
magnitude of these deviations. Several techniques can be employed to summarize the magnitude of the deviations from the mean. One method is the mean absolute deviation (MAD), which is simply the mean of the absolute values of the deviations from the mean:
MAD = Eix-xi This technique
is
not generally used for mathematical reasons that are beyond
The more popular statistic, which proves to be the most straightforward mathematically, is based on squared deviations from the mean and is called the sample variance, defined as:
the scope of this book.
$2=
n^
X)
;
(2.2) 1
NOTE: The denominator the sample
The following ance.
It
in the
sample variance
is
(n
—
1),
not n as was the case with
mean. table organizes data for the
computation of the sample varifrom the mean, and squared
displays the data elements, deviations
deviations from the mean, respectively.
2.2 Descriptive Statistics
and Graphical Methods
X
(X-X)
100
-23.6
556.96
110
-13.6
184.96
114
-9.6
92.16
121
-2.6
6.76
130
6.4
40.96
(X-X) 2
130
6.4
40.96
160
36.4
1,324.96
865
-0.2*
2,247.72
*
The sum
is
23
not exactly zero due to rounding.
For Example 2.1, the sample variance
=
5"2
is
2,247.72
=
374.6
6
The sample variance
is interpreted as the average squared deviation from on average, systolic blood pressures in our sample are 374.6 units, squared from the sample mean of 123.6. This information is important; however, in its present form it does not exactly achieve our original goal, which was to compute a measure of the typical deviation from the mean in the sample. Recall that we summed the square of each deviation from the mean since their sum was zero. Because of this step, the sample variance does not address our original objective directly. To return to our original units, we compute what is called the sample standard deviation, denoted s, defined as the square root of the sample variance.
the mean. Therefore,
5
The sample standard deviation of 5
=
= Vs 2
(2.3)
the systolic blood pressures
y/374.6
=
is
19.4
we have a statistic which can be interpreted from the mean. In this sample, systolic blood pressures are about 19.4 units from the sample mean. It is often difficult to interpret the After taking the square root,
as the typical deviation
value of a standard deviation
Standard deviation, however, the second sample of n I
—
(e.g., is is
19.4 large, small, or appropriate?).
~ subjects
we
from the same population. and the same mean (X = 123.6); the second sample is s = 2.6. The stan-
be second sample had the same size (n
however, the standard deviation for dard deviation in the second sample vations
.ire
tightly clustered
The
very useful for comparing samples. Recall
is
=
selected ~l
much
smaller because
around the sample mean
ol 12
all
v(->.
ol the obser-
There
is
much
24
Chapter 2 Summarizing Data
more
variability in the systolic blood pressures measured among patients with diagnosed coronary artery disease in the first sample (sample 1: s = 19.4) as
compared
An
to the second sample
(5
—
2.6).
computing the sample variance that mathematically equivalent to the formulation provided in (2.2). This alternative formula is called the computational formula for the sample variance alternative formula
available for
is
is
(Eq. 2.4).
The formula provided
in (2.2) is called the definitional
formula for
the sample variance.
£X
2
=
s~
where
-(£X) 2 /k (2.4)
^X
= the sum of the squared observations, and CY^X) = the square of the sum of the observations 2
1
The computational formula
(2.4)
can be easier to work with than the
defini-
components in the computational formula are in most cases easier to compute (i.e., Yl^2 and (J^X) 2 ). We will now illustrate the use of the computational formula for the sample variance using data from Example 2.1. The following table displays each tional formula given in (2.2), because the
data element, along with each data element squared.
X
X2
100
10,000
no
12,100
114
12,996
121
14,641
130
16,900
130
16,900
160
25,600
865
109,137
Using the computational formula
s~
—
(2.4):
109 ,137 -(865) 2 /7
7-
1
109,137- 106,889.3
2247.7
=
374.6
As noted, the computations can be somewhat easier with the computational formula as compared to the definitional formula. To implement the computational formula, we need only to compute the sum of the data elements and the
2.2 Descriptive Statistics
sum
and Graphical Methods
25
of rhe squared data elements, as opposed to deviations and deviations
squared for each data element. The reduced number of calculations with the computational formula reduces the chance of error.
A standard data summary
for a continuous variable in a
sample
consists of three statistics:
sample
size (n)
sample mean (X)
sample standard deviation
These three
statistics
provide information on the number of subjects
the sample, the location,
We
(5)
and the dispersion of the sample,
purposely chose a small data
much
in
respectively.
set to illustrate these statistics.
It is
easy
which it would be impossible to view the entire sample. In such cases, the sample size, mean, and standard deviation provide a very informative and useful summary. Publications and reports almost always include these statistics. As a general guideline, descriptive statistics should include no more than one decimal place beyond that observed in the original data elements. For example, the systolic blood pressures are recorded as whole numbers. Therefore, descriptive statistics are presented to the nearest tenths place (i.e., one decto imagine applications with
sample
larger
sizes in
imal place).
The standard summary
A number
for
Example
—
2.1 \sn
of other descriptive statistics beyond
dard summary
statistics (i.e., n,
variables, including the
X, and
median and
s)
7,
X=
123.6, and 5
what we have
=
19.4.
called the stan-
are also widely used for continuous
quartiles.
The sample median
many
is
defined as
below it. computed by arranging the data elements from smallest to largest and successively counting from the right and left to arrive at the median, or middle, value. For example, if we arrange our 7 data elements from smallest to largest and count in from the right and left simultaneously, we arrive at the median or middle value after three steps: the middle value.
The median
Step
1:
It is
the value that has as
values above
as
is
Inn
TttvX
110
1
14
121
130
130
1
60
Step 2:
we- 4+0
114
121
130
Step
—ne-
He- 4^e
foe
4+4
121
F30-
130
J:
it
160
ft
Median Because the Dumber of observations in this sample is odd [n = 7), this procedure produces a single number, the median value, upon successively counting
26
Chapter 2 Summarizing Data
in
from the
right
and
In
left.
Example
2.2,
we
will illustrate the
same proce-
dure with an even number of data elements. The interpretation of the median in Example 2.1 is as follows: Half (50%) of the systolic blood pressures are greater than 121
and
half
(50%)
are less than 121.
Both the mean and the median are
statistics that
or typical value of a particular characteristic. useful
when
compared
measure the average
The median
is
particularly
there are extreme values (either very small or very large as
Suppose in Example 2.1 that the was not 160 but 260 instead. The sample mean would be X = 100 + 110 + 114 + 121 + 130 + 130 + 260)/7 = 137.9, which does not look like a typical value (since 6 of 7 observations are below it). The sample mean is affected by extreme values. In this case, the value 260 inflates the mean value, and it is therefore no longer representative of a typical value. A better measure of location in this situation is the median, which is still 121 and is more representative of a typical systolic blood pressure in this sample. In the absence of extreme values, the sample mean and the sample median will be close in value and the sample mean is considered a better measure of locato other values) in the sample.
maximum
value
(
observations contribute to the sample mean.
tion since
all
mean and
the sample
When
the sample
median are very different, it suggests that extreme values are affecting the mean and that the median might be a more appropriate measure of location. As we work through more examples, it will become clear which measure of location (the mean or the median) is more appropriate in specific applications.
Example
=
method of sucmedian is easy to implement. When the sample size is larger, a more efficient method for computing the median involves two steps. In the first step, we compute the position of the median in the ordered data set, and in the second step, we locate the median value. When the number of observations is odd, the position of the median is computed as follows: Because the sample
size in
2.1
is
small (n
7), the
cessively counting into the middle of the ordered data set to locate the
«
+
1
2.5
For Example 2.1, the median is in the fourth position ((7+l)/2 = 4) in and is equal to 121. The median represents the middle value; to further describe the sample, we now analyze the top and bottom the ordered data set
halves.
The first and third quartiles are the values that separate, respectively, the bottom and top 25% of the data elements. The first quartile of the sample, denoted Q,, is the sample value that holds approximately 25% of the data elements at or below it and approximately 75% above or equal to it. The third quartile, denoted
or above
it
Q„
holds approximately
and approximately
75%
25%
of the data elements at
below or equal to
it.
The median
is
also
2.2 Descriptive Statistics
referred to as the second quartile, is
Q
:
The
.
best
and Graphical Methods
way
to determine the quartiles
median and
to follow the two-step procedure outlined for determining the
compute
(i.e., first
27
the positions of the quartiles in the ordered data set,
then locate the values).
When
the
number of observations
in a
sample
is
odd,
the positions of the quartiles are determined by the following formula:
+
n
3 (2.6)
where
[k]
is
[2.9]
the greatest integer less than k; for example, [2.1]
=
2, [5.0]
=
=
5, [10.8]
10,
=
2,
and so on
For Example 2.1, "« „431 + 3"
i"7
[7
4
+ 3nn -i-
4
Example 2.1, the quartiles are tom of the ordered data set. The In
"10" rini
[2.5|=2 _
in the
4
_
second positions from the top and bot-
is Q, = 10 and the third quar25% of the systolic blood pressures are 10 or lower and approximately 25% of the systolic blood pressures are 130 or higher. Again, because the sample in Example 2.1 is so small (n — 7), we do
tile is
O; =
first
quartile
1
130. Approximately
I
of the statistics described to summarize and interpret these data.
not need
all
In larger
samples, the quartiles are very informative statistics for understand-
ing the distribution of a particular characteristic.
The mode of the data set is defined as the most frequent value. In Example 2.1, the mode is 130, since it appears twice and the remaining values appear only once. A sample can have one mode or several modes. A sample with no repeated values has no mode. Other very informative descriptive statistics include the minimum and maximum values. In Example 2.1, the minimum is 100 and the maximum is 160. These values can be very useful, especially with regard to identifying outliers. Outliers are values that exceed the "normal" or expected range of values. or example, suppose ages are recorded on each of 20 individuals participating in an experimental study. Suppose the mean age for the sample is 83.5, with a standard deviation of 5.6. Suppose the minimum age is 70 and the highest five ages, in descending order, are 10, 90, 89, 89, and 87. Assuming that each age was I
1
recorded accurately, an age of
1
10 might be considered an outlier.
incorrect value, just a value outside
—
in this
ease,
Outlying values can be determined b\ an expert area or by using one of several
Assuming there
statistical
in
It
is
not
above—the normal
.\n
range.
the particular substantive
definitions
(see
Example
2.3).
no errors in the data, the statistical analyst need not do anything in particular w ith respect to outliers, onl\ be aware of their existence and their impacl on certain descriptive Statistics (e.g., the sample mean!. are
28
Chapter 2 Summarizing Data
Another descriptive
statistic that
addresses dispersion in a data set
is
the
The range is defined as the maximum value minus the minimum value. In Example 2.1 the range = 160 — 100, or 60. Some investigators report the range as "100 to 160"; others report the range as 60. Both reports are appropriate. As noted, the range addresses dispersion in the sample. In Example 2.1, the observed systolic blood pressures cover 60 units. The range is based on only two values in the sample, the maximum and minimum, and although it is a very useful statistic, it can be somewhat misleading, especially in the presence of outliers. For example, if the maximum value was 260 instead of 160 and all other data elements were unchanged, the range would be 260 — 100 = 160. This would suggest much more dispersion in the sample than the range of 60 (based on the data presented in Example 2.1 when only a range.
)
single observation changed.
We suggest that the range
be interpreted with cau-
and that the standard deviation be used to address dispersion in a sample. Consider the following samples, call them samples A, B, C, and D. The samples are all of the same size (n = 11), have the same means (50) and ranges (100), yet
tion
the standard deviations are different.
How are the samples different? Sample
Raw Data
Summary n
X Range
50
10
20
50
20
20
50
30
20
50
40
20
50
50
50
50
50
60
80
100
50
70
80
100
50
80
80
100
50
90
80
100
100
100
100
100
11
11
11
11
Statistics
50
50
50
50
100
100
100
100
11
33
35
50
2.2 Descriptive Statistics
Based on the standard deviations, the
among
first
and Graphical Methods
sample has the
observations, and the last sample has the most.
least
29
variation
The range,
in this
example, does not differentiate the samples.
SAS Example
2.1
Summary
Statistics
The following
on
Systolic
Blood Pressures Using SAS
were generated using SAS Proc Univariate and Section 2.4 for more details) and the An interpretation of the relevant components appears
descriptive statistics
(see the following interpretation
data
in
Example
2.1.
after the output.
SAS Output
for
Example
2.1
Summary Statistics Summary Statistics The UNIVARIATE Procedure Variable: sbp (systolic blood pressure)
Moments Sum Weights 123.571429 Sum Observations 19.3550781 Variance 1.04211435 Kurtosis 109137 Corrected SS 15.663069 Std Error Mean 7
Mean Std Deviation Skewness Uncorrected SS Coeff Variation
7
865
374.619048 1.63467176 2247.71429 7.31553189
Basic Statistical Measures Location Variability Mean 123.5714 Std Deviation 19.35508 Median 121.0000 Variance 374.61905 Mode 130.0000 Range 60.00000 Interquartile Range 20.00000
Test Student's
t
Signed Rank
Tests for Location: Mu0=0 -Statisticp Value t 16.89165 Pr > Itl = IMI 0.0156 S 14 Pr >= 0.0156 SI I
30
Chapter 2 Summarizing Data
Quantiles (Definition 5) Quantile Estimate 160 100% Max 99% 95% 90% 75% Q3 50% Median 25% Ql 10%
160 160 1 6
130 121 110 100 100 100 100
5% 1% 0% Min
Extreme Observations HighestLowest Value Obs Value Obs 100 110 114 121 130
Interpretation of
114 121 130 130 160
4 2 3
1
7
SAS Output
for
SAS Example
3
1 6
7 5
2.1
The SAS Univariate Procedure is used to generate descriptive statistics on a continuous variable. SAS generates a number of descriptive statistics in the section labeled "Moments"; we will highlight only a few. The sample size is 7 (notice that SAS uses uppercase N as opposed to lowercase n to denote sample size), the sample mean is 123.6, and the sample standard deviation is 19.4. The sum of the observations (i.e., X!^C = J2X) is 2 865, and the sample variance, s is 374.6. The skewness of a sample indicates the degree of asymmetry in the sample distribution. Values close to are indicative of symmetry. The kurtosis of a sample indicates the thickness in the ,
tails
of the distribution
(i.e.,
the degree of clustering of observations at the
extremes). Again, values close to
indicate lack of clustering in the
tails.
somewhat unreliable in small samples and should be interpreted with caution. The normal distribution (discussed extensively in Chapter 3) has skewness = and kurtosis = 0. The uncorrected sum of squares, "Uncorrected SS," is the sum of the observations squared (i.e., 2~Z^2 = 109,137, which we used in the computational formula for the sample variance). The corrected sum of squares, "Corrected 2 SS," is the numerator of the sample variance, ]T(X — X) = 2247.7. The Estimates of skewness and kurtosis are
2.2 Descriptive Statistics
and Graphical Methods
31
"Coeff Variation," is defined as the ratio of the samsample mean, expressed as a percentage (i.e., (s/X) * 100). The standard error of the mean, "Std Error Mean," is
coefficient of variation,
ple standard deviation to the (
=
\
defined as s/y/n.
The next statistics
for
part of the
SAS output summarizes
continuous variables
the
most popular summary
in the section entitled
"Basic Statistical
Measures." Several measures of location are provided (the mean, median, and model as are several measures of dispersion (standard deviation, range, and is 160 — 100, or 60; the interquartile range, and third quartiles (Q, — Q,) is 20. The most appropriate measures of location and dispersion depend on whether there are outliers in the data set. If there are no outliers, the mean and standard deviation are the most appropriate measures of location and dispersion, respectively. If there are outliers, the median and interquartile deviation (defined as Q3 - Q\)/2)) are the most appropriate measures of location and dispersion,
The range
interquartile range).
the difference between the
first
(
respectively.
The next
part of the
SAS output contains
"Tests for Location,"
these will be discussed in detail in Chapter 5.
SAS output
After "Tests for Location,"
displays the "Quantiles" (or per-
where the kth quantile is defined as the score that holds k°o of the data below it. For example, the maximum value is equivalent to the 100th quantile and equal to 160 in SAS Example 1, the 75th quantile is equivalent to the third quartile 130), and so on. SAS also presents the 99th quantile, which is equal to 160 in SAS Example 1. Since this is a small data set, these fine classifications are unnecessary and not meaningful. SAS then prints the "Extreme Observations" in the data set. In particular, the five smallest and five largest values are printed. Next to each value, in centiles) of the variable,
(
parentheses,
the observation
is
number
(i.e.,
the position of the observation in
the data set); for example, the smallest value
among
Example 2.2
is
100, the fourth observation
the seven.
Summary
Statistics
on Total Cholesterol Levels
Eight subjects are randomly selected from a population of patients with hypertension. Total serum cholesterol, in
mg/100 ml,
is
measured on each sub-
lea and the sample data are
212
197 In the following,
duced
in
was
(as
I
we summarize
xample
1S4
211
233
245
219
the cholesterol data using the statistics intro-
2.1. Since total
serum cholesterol
blood pressure), the same
systolic
260
is
statistics will
a
continuous variable
be computed.
We
will
our discussion here to items and concepts that were addressed in detail xample 2. Again, a small sample si/e is used to illustrate the calculation
limit in
1
1
.
computation time to a minimum. same techniques can be applied to larger samples.
of descriptive statistics to keep actual practice, the
In
32
Chapter 2 Summarizing Data
We
will let the variable
X
denote total serum cholesterol. The table
dis-
plays the data elements, which have been ordered from smallest to largest,
along with the value of each data element squared (which
is
used
in the
com-
putation of the sample variance):
X
X2
184
33,856
197
38,809
211
44,521
212
44,944
219
47,961
233
54,289
245
60,025
260
67,600
392,005
1761
The number of level in this
sample
patients, or
sample
size, is
n
—
8.
The mean
cholesterol
is
x=£^
1,761
Notice that some of the total cholesterol
220.1 levels are
above the mean and others
Example 2.1. The sample mean represents a typical cholesterol level in this sample. To address dispersion in the sample, we use the computational formula for are below. This will always be true, as displayed in Figure 2.2 using
the sample variance (2.4):
2
_"
E*
2
- (E x 2 /» _ 392,005 ~ )
n-
392,005
7
1
-
2
(l,761) /8
(3,101,121)/8
392,005 - 387,640.125
7
4364.875
=
7
623.6
7 Generally, the variance ple standard deviation
is
is
not used to summarize dispersion. Instead, the sam-
computed: s
=
V623.6
=
25.0
The sample standard deviation represents how far each total cholesterol level is from the mean of 220.1. Again, by itself the standard deviation is often difficult to interpret. In particular, it is generally difficult to quantify what value of a standard deviation is considered large and what value is considered
2.2 Descriptive Statistics
and Graphical Methods
33
knowledge of the characteristic under inwhat is large and small with respect to the
small. Individuals with substantive
vestigation might have a feel tor
standard deviation.
Our standard summary
=
X= 220.1,
of the cholesterol levels
=
continuous variable)
(a
The maximum cholesterol level is 260 and the minimum is 184. The range is 260 — 184, or 76. There is a substantial difference between the smallest and largest cholesterol levels, a difference is
n
8,
and
5
25.0.
of "6 units. Recall that the range the range
The median
deviation. levels
less useful as a
is
above
and
t«4
Step 2:
r«4
Step
r«4-
3:
233
245
260
219
233
245
260
219
236
246
260
197
211
212
219
W=
211
212
244
212 it
it
—H^
Two Middle Because the number of observations
two middle
mean
values.
of the
When
two middle
sample
in this
the sample size
Values
Example
2.2,
50%
even (n
is
even, the
is
median
= is
8),
there are
defined as the
rallies:
Median = In
it,
to arrive at the median, or middle, value:
left
Stepl:
of the cholesterol
can be computed by and successively counting from
of the cholesterol levels below
arranging the data from smallest to largest, the right
50%
value, or the value that holds
50%
and
it
is another measure of dispersion. In general. measure of dispersion than the sample standard
=
215.5
of the cholesterol levels are above 215.5 and
cholesterol levels are below 215.5.
50%
of the
We now
have two statistics that represent a typical cholesterol level, the sample mean and the sample median. Although they convey different information, in general only one is necessary. Which is the best statistic to address location in this sample? Reviewing the cholesterol levels in the sample, there do not appear to be
outliers at either extreme, thus the
mean
a better
is
measure of location. An
individual with clinical expertise would, however, be in a better position to
make
that assessment. In
outliers based
on
the sample size to largest
median.
and
In
is
to
Example
2.
will present guidelines for assessing
small .\nd
it
is
count into the middle from right and
applications where the sample size
those value(s).
When
their
m
Example
2.\.
easy to order the data elements from smallest
method described
the position's! of the middle value(s)
and
we
formulations. In this example, as
cient to use the two-step
values,
J
statistical
m
m
is
left
larger,
Example
2.
1
to determine the
may
it
.
first
the ordered data set
number of observations is even, positions are computed as follows: the
+ )
more effiwe compute
be
and then locate
there are
two middle
34
Chapter 2 Summarizing Data
For Example 2.2, n
8
4th position
1-1+1=1^) The median
is
ordered data
mean
the
set
=
5th position
of the observations in the 4th and 5th positions in the
[212
(i.e.,
1
+ 219]/2 =
215.5).
To further describe the sample, we now compute the quartiles. As noted in Example 2.1, the best way to determine the quartiles is to first compute the positions of the quartiles in the ordered data set and then to locate the values. When the number of observations in the sample is even, the positions of the quartiles are determined
by the following formula: n
+2 (2.8)
For Example 2.2, 'n
+ 2'
"8
=
4
+
2" "10"
4
=[2.5]
-r
=
2
The quartiles are in the second positions from the top and bottom of the ordered data set. The first quartile is Qj = 197 and the third quartile is Q 3 = 245.
SAS Example
2.2
Summary
Statistics
The following
we produced the tensive summary tics
on Total Cholesterol Levels Using SAS
descriptive statistics
abbreviated illustrated in
were generated using SAS. For Example 2.2, statistics as opposed to the more ex-
summary
Example
were produced by SAS Proc Means
Section 2.4 for
more
2.1.
The following
(see the
details) using the data in
descriptive statis-
following interpretation and
Example
2.2.
A
brief interpre-
tation appears after the output.
SAS Output
for
Example 2.2
Summary Statistics The MEANS Procedure Analysis Variable chol total serum cholesterol :
Mean
Std Dev
Minimum
Maximum
220.1250000
24.9710547
184.0000000
260.0000000
and Graphical Methods
2.2 Descriptive Statistics
Interpretation of
SAS Output
The SAS Means Procedure
Many
tinuous variable.
shown
is
"N"
Example 2.2
for
used to generate descriptive
different statistics
are the default statistics.
35
The sample
statistics
on a con-
can be requested; the size
8 (notice that
is
SAS
statistics
uses up-
opposed to lower case "n" to denote sample size), the sample mean is 220.1, and the sample standard deviation is 25.0. The minimum and maximum cholesterol levels are 184 and 260. SAS displays summary statistics to eight decimal places by default. Users should round appropriately to report percase
summarv
Example 2.3
as
statistics.
Summary Statistics on Ages A sample of 51 individuals is selected vascular risk factors.
for participation in a study of cardio-
The following data represent
the ages of enrolled indi-
viduals measured in years (continuous variable). Here, age
usual
way with
he/she turns 66.
71
62 66 72
63 66 72
76
77
77
60 66
a person being recorded as 65, for
The data 64 67 73 77
The number of size
than
in
is
measured
example,
in the
until the
day
are as follows:
64 67 73 77
65 68 73 79
65
67 73 77
subjects, or
sample
previous examples. Here
65 68 73 82
68 75 83
85
65 70 75 85
66 70 75 87
66 71
76
=
n 51, a much larger sample not possible to interpret the age data
size, is
it is
65 70 75
65
simply by inspecting the values. Instead,
we need summaries
of location and
dispersion.
The mean age
in this
sample
is
x=^
3,637
=
71.3
51
we will compute the sample standard deviation. As a first step, we compute the sample variance using the computational formula presented in Example 2.1: In order to assess dispersion in the sample,
26 1,439 -(3,637) 2 /51 (51
The sample standard deviation s
1
magnitude
f
=
41.4
is
=
v 4
In general, participants ages deviate
th.u the
-1)
1
.4
= 6.4
from the mean of 71.
the standard deviation oi the ages
is
5
bj 6.4
wars. Notice
smaller than the stan-
dard deviations we computed on the systolic blood pressures
in
Example
2.1
36
Chapter 2 Summarizing Data
(s SBP = 19.4) and on the cholesterol levels in Example 2.2 (s CHOL = 25.0). Standard deviations are interpreted relative to their scale of measurement. In Example 2.3, the participants are very homogeneous with respect to age. It is
possible that the study objectives were focused on individuals 60 years of
age or older. Most, sion criteria,
summary
if
not
all,
studies have very explicit inclusion
which must be recognized
in
and exclu-
order to appropriately interpret
statistics.
As noted earlier, in many publications and research reports, investigators do not present raw data (i.e., observations measured on each member of a sample); instead, they present summary statistics. Suppose that we did not have access to the actual ages of each participant here, and that instead we had only the summary statistics: n = 51, X = 71.3, and s = 6.4. The mean and standard deviation are used to understand where the data are located and how they are spread. The Empirical Rule can be used to learn more about a particular characteristic based on these commonly available statistics (discussed further in Chapter 4):
Empirical Rule
68% of the observations fall between X — s and X + s. Approximately 95% of the observations fall between X— 2s and X+2s. Approximately
Approximately
all
Using the data
in
of the observations
fall
between
X— 3s
and
X+ 3s.
Example 2.3
the Empirical Rule indicates that approxibetween 71.3 - 6.4 = 64.9 and 71.3 + 6.4 = 77.7, approximately 95% of the ages fall between 58.5 and 84.1, and almost all of the ages fall between 52.1 and 90.5. Because we have the actual observations here, we computed the percentages of 51 observations that actually fell into each range. The following table illustrates how closely the Empirical Rule approximates the distribution of ages in this sample:
mately
68%
of the ages
fall
Empirical Rule
Range
Percent of Observations
64.9-77.7
Approximately
68%
78.4%
58.5-84.1
Approximately
95%
94.1%
52.1-90.5
Almost
all
Percent of Sample Data
100%
The Empirical Rule suggested that approximately 68% of the ages would fall between 64.9 and 77.7. In Example 2.3, 78.4% of the ages actually fell between 64.9 and 77.7. Similarly, the Empirical Rule suggested that
2.2 Descriptive Statistics
approximately
Example
2.3,
95%
94.1%
would
of the ages
of the ages actually
52.
1
and 90.5, and
in fact
The computation
of the
Summary
Statistics
all
of the ages
mean and standard
cedure to generate descriptive
2.3
between 58.5 and 84.1. In between 58.5 and 84.1. Finally,
would
fall
between
they do.
with a sample of 51 observations.
SAS Example
37
fall
fell
the Empirical Rule suggested that almost
and Graphical Methods
We now
cumbersome SAS Proc Univariate prodata in Example 23. deviation are
use the
statistics for the
on Ages Using SAS
The following descriptive and the data in Example
statistics
2.3.
An
were generated using SAS Proc Univariate components
interpretation of the relevant
appears after the output.
SAS Output
for
Example 2.3
Summary Statistics The UNIVARIATE Procedure Variable: age (age in years)
51
Mean Std Deviation Skewness Uncorrected SS =riation
71.3137255 6.4358067 0.57609039 261439 9.02463958
Moments Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean
51
3637
41.4196078 -0.2678579 2070.98039 0.90119319
Basic Statistical Measures Location Variability Mean 71.31373 Std Deviation 6.43581 Median 71.00000 Variance 41.41961 Mode Range 65.00000 27.00000 Interquartile Range 10.00000
Test Student's Sign Signed
t
Tests for Location: Mu0=0 -Statisticp Value 79.13256 Pr > It = = 663 8). The binomial formula and Table B.l produce the probability of observing exactly x successes out of n. In this example we are interested in the probability of observing more than 8 successes. Before using the binomial formula (or Table B.l), we must restate our problem in a format compatible with patients in
the antibiotic
is
effective),
the binomial formula (and Table B.l):
P(X>
8)
= P(X=9or X =
10)
= P(X=9) + P(X=
10)
1.4
We
can
now compute
X=
P{
and P(X
9)
=
The Binomial Distribution
107
10) using the binomial formula
(3.6), applied twice.
*°
P(X=9)= 9!(
P(X=
10)
=
P(X>
Thus,
10
!
—
0.50
—-^—— 8)
=
more than
-0.50)'°- 9
(1
0.0098
10
(l-0.50)
+ 0.0010 =
8 patients
experiment are the values
0.50
T
In this application, there tive in
9
=
10(0.0020X0.5)
=
0.0098
9)!
,
0.0108
°- 10
=
1(0.0001)(1)
(see also
=
0.0010
Table B.l).
1.1% chance that the antibiotic will be effecwhen given to 10. The possible outcomes of this
is
a
in the
sample space, S
=
{0, 1, 2, 3, 4, 5, 6, 7, 8,
9, 10}. Because the antibiotic is 50% effective for any given patient, we are more likely to observe 4, 5, or 6 successes out of 10 than we are to observe few
or 2) or
(0, 1,
many
(8, 9, 10)
successes (see Table B.l).
Example 3.4 from Section 3.4 in which we had a population contwo diseased and two nondiseased. In the example we selected two subjects at random. Each selection resulted in one of two possible outcomes: selection of a diseased subject or selection of a nondiseased subject. The probability that we selected a diseased subject was 0.50 under the sampling with replacement strategy. Note that sampling with Recall
sisting of four laboratory mice,
replacement ensures constant probability of success bility
of selecting a diseased subject
ple with replacement).
success. (In general, larly in
we
is
Suppose we call the
(in this
example, proba-
0.5 on each selection as long as
call the selection
outcome of
we sam-
of a diseased subject a
interest a success; often, particu-
medical applications, the outcome of interest
is
not the healthy out-
The random variable X under the sampling with replacement strategy illustrated in Example 3.4 is an example of a binomial random variable, whose probability distribution is given below. Recall
come,
e.g., disease,
we developed
mortality).
this probability distribution
sample, assigning the value of
by enumerating every possible
X (the number of diseased subjects selected
into
each sample) and summarizing:
X
P(X) 16
=
0.25
8/16
=
0.50
4
=
0.25
A 1
2
16
is .in example of a binomial probability distriand p = 0.5. The binomial distribution formula and same probabilities; tor example, with n — 1 .md p = 0.5,
This probability distribution bution with n
—
Table B.l give the
1
108
Chapter 3 Probability
the probability of selecting
P(X=
0)
2
=
0!(2
^
0.50°(1
random
X
variable
(i.e.,
a Example
3.5,
we
was
=
1(1)(0.25)
involved
70%
mean and variance of computed as follows:
=
ix
In
2-
0.50)
is
=
0.25
can be shown that the mean and variance
it
the
cesses out of n binomial trials) are
ability of success
-
0)
Q
In the binomial distribution,
of the
=
no diseased subjects (X
number of
np
suc-
(3.7)
= np(l-
2
the
p)
five patients in the
experiment, and the prob-
The
possible outcomes of the
for each patient.
experiment are given in the sample space S — (0, 1, 2, 3, 4, 5}. Each time the experiment is run, exactly one of these outcomes (number of successes) is observed.
The mean
the antibiotic
is
number of
(or expected)
effective
is /x
= np = 5
0.7
patients (out of five) in
=
3.5.
We
3.5 successes in any performance of the experiment; instead,
of the values in the sample space
The variance
in the
standard deviation
(e.g.,
whom
could never observe
we observe one
exactly 3 or 4 or 5 successes out of 5).
number of successes is a 2 = 5(0.7)(1 — 0.7) = 1.05. The a — 1.02. The mean number of successes represents the
is
number of successes. Example 3.5), we see that
(shown outcomes (those with the highest
typical
In reviewing the probability distribution
in
the
most
likely
and 4 successes. where the probability of success was
probabilities) of the experiment are 3 In
Example
3.6,
patient, the expected otic
is
effective
is
/u
number of
=
np
=
patients (out of 10) in
3.5
likely
= 5. Again, any of the outcomes in on any performance of the experiment.
10(0.5)
the sample space can be observed
We are most p = 0.50.
50% for each whom the antibi-
(or expect) to observe 5
successes out of 10
when
The Normal Distribution distribution is our second probability model and is the most widely used probability distribution for continuous random variables. A char-
The normal acteristic
(continuous variable)
distribution of values
is
bell-
is
said to follow a
normal distribution
if
the
or mound-shaped (Figure 3.4).
The horizontal axis displays the values of the continuous normal random The vertical axis is scaled to accommodate the height of the curve at each value of the random variable that reflects the probability (or relative frequency) of observing that value. The total area under the normal curve variable X.
3.5 The
Figure 3.4 The Normal
X=
is
Normal
109
Distribution
Distribution
normal random variable
1.0, as
it is
a probability distribution.
The normal distribution is one where more likely (have higher probabili-
values in the center of the distribution are
than values at the extremes.
ties)
The mathematical formula
f(x)= where x
is
—
random
a continuous
normal probability distribution
for the
'
e
t)la]2
variable (— oc
= mean of the random variable X o = standard deviation of the random
is
< x
/j.) = P( X < n) — 0.5, see Figure 3.5a.) A characteristic that follows a normal distribution is one in which there are as many values above the mean as below.
2.
=
= the mode. This attribute follows directly from symmetric at the mean, then half (50%) of the values are above the mean and half (50%) are below the mean. This is the definition of the median. Notice in Figure 3.5a that the peak of the
The mean I
.
If
the
median
the distribution
distribution
is
exactly
is
in
the center of the distribution (at the
mean =
median). The height of the curve indicates the probability (or relative frequency) of observations at each point. The peak indicates the most frequent value, which
V
I
Ik-
mean and
distribution.
It
is.
variance,
we know
by definition, the mode. //
and
n-2
,
completer) characterize the normal
that a particular characteristic follows a
normal distribution and we know // and a, then we know everything about that distribution. The mean and variance arc the only parameters
IIO
Chapter 3 Probability
Figure 3.5(a)— (c)
Properties of the
Normal
Distribution
P(a
=
0.7 (see
Example
J.5).
I
he second example
mean
The thud example
illus-
computation ot percentiles xample J. 8. presented m I
use.
variables from a normal distribution with
ION and standard deviation 14 (see Example trates the
its
of the
J. 8).
normal distribution using the data
;
126
Chapter 3 Probability
For each example, we present three components:
SAS program code
1.
the
2.
the computer output
3.
components of
a description of the relevant
the
computer output alonj
with their interpretation
SAS Example
3.5
Generating
Random
Variables from the Binomial Distribution
Effectiveness of Antibiotic
(Example
3.5)
Ranbin Function (Generates Random Variables from Binomial Distribution)
Suppose an antibiotic has been shown to be bacteria. If the antibiotic
the probability that
it
In this application,
is
will be effective
we
70%
common
effective against a
given to five individuals with the bacteria, what
will use
=
binomial distribution with n
5
SAS
on exactly
is
three?
to generate
random
(number of trials) and p
variables
=
from the
0.70 (probability
on any trial). We will generate many such random variables, and in doing so estimate the theoretical probability distribution. The SAS program of success
code follows.
Program Code options ps= 64 ls=80; data one;
Formats the output page to 64 lines
do i=l to
Beginning of a
5
000;
Beginning of Data Step (Data set
DO
loop which
in length.
name
will
80 columns
in width.
one).
be repeated 5000 times. The index
i
counts the iterations.
x=ranbin(21439,5,0.7
Generates a random variable, called with n=5, p=o.7. The
output
first
Writes the generated variable,
end; run;
End of DO loop.
proc means; var x;
Procedure
run;
End of Procedure section.
proc freq;
Procedure
End of Data
tables x;
from the binomial distribution
x,
is
the seed.'"
to data set one.
Step.
call
(Proc
Means
to generate
summary
statistics).
Specification of variable x.
call
(Proc Freq to generate frequency distribution=probabiln\
distribution).
run;
x,
argument to the Ranbin function
Specification of variable x.
End of Procedure section.
3.7 Applications Using S AS
127
The seed is simply a random number used as a starting point for the random number generator (i.e., ranbin). When the seed is changed, different (1)
values are generated.
Computer Output The MEANS Procedure x Analysis Variable :
N
Mean
Std Dev
5000
3.4942000
1.0282888
X
Frequency
1
135 686 1529 1798 837
3
4 5
Maximum 5.0000000
The FREQ Procedure Cumulative Percent Frequency 0.30 2.70 13.72 30.58 35.96 16.74
15 2
Minimum
Cumulative Percent 0.30 3.00 16.72 47.30 83.26 100.00
15
150 836 2365 4163 5000
Interpretation
The
first
part of the output
is
from the Means procedure. Shown are the num(N — 5000), the mean, standard deviation,
ber of observations in the sample
minimum, and maximum. In this application, the summary statistics are computed on the random variable X, which reflects the numbers of patients in
whom
the antibiotic
is
effective.
Recall for the binomial distribution, the
mean is given by // = up. In Example 3.5, n = 5 and p = 0.7. The theoretical mean is (x = 5 0.7 = 3.5. In this data set in which we generated 5,000 random variables from a binomial distribution with n = 5 and p = 0.7, the observed mean is -5.494. The next section of output displays a frequency distribution table for the random variable X. Notice that the values of X generated by SAS are between and 5. The observed frequency distribution (based on 5000 observations)
closely
bution (see Example
approximates J.5).
the
true
binomial
probability
lor example, from Table R.l
m
distri-
Appendix
B,
;
128
;
;
;
Chapter 3 Probability
P(X=0) =
P(X =
0.0024,
0.3087,
P(X=
(shown
in the
4)
=
SAS output)
1)
address the original question:
given to five individuals, what exactly three?
From
the
P(X=
3)
=
distribution
0.003, 0.027, 0.137, 0.306, 0.360, and 0.167.
is
we can
Using the SAS output,
=0.0284, P(X=2) = 0.1323, 5) =0.1681. The observed
P(X=
0.3601,
the probability that
is
P(X=
SAS output,
=
3)
it
the antibiotic
If
is
will be effective in
0.306.
SAS Example 3.8 Generating Random Variables from the Normal Distribution Systolic Blood Pressures (Example 3.8)
Rannor Function (Generates Random
Variables from the Standard
Normal
Distribution) Systolic
mean
blood pressures are assumed to follow a normal distribution, with a
of 108 and a standard deviation of 14.
In this application,
we
SAS
will use
random
to generate
variables
from the
standard normal distribution (Z) and then transform them into normal ran-
dom
variables with a
mean
of 108 and a standard deviation of 14.
Program Code options ps = 64 ls=80; data one; mu=108; sigma=14
Formats the output page to 64 lines Beginning of Data Step (Data set
in length.
name
80 columns
in
width
one).
Create a variable called mu, assign the constant 108 Create a variable called sigma, assign the constant
do i=l to 10000;
Beginning of a
DO
loop which
will
14
be repeated 10,000 times. The index
i
counts the iterations.
z= rannor (137 55
Generates a random variable, called
)
x=mu+( z*sigma)
Creates a
;
output
2,
new variable,
which
x.
is
a linear function of z (x
Writes the generated variables, x and
end; run;
from the standard normal
The only argument to the Rannor function
distribution.
z.
is
—
the seed.'" fi
+
z
a).
to data set one.
End of DO loop.
End of Data
Step.
proc chart vbar z/type=pct; vbar x/type=pct;
Specification of variable
z.
type=pct displays relative frequencies
Specification of variable
x,
type=pct displays relative frequencies
run;
End of Procedure section
proc means; var z x;
Specification of variables z
run;
End of Procedure section
Procedure
Procedure
(1)
The seed
dom number
is
simply a
generator
ues are generated.
call
call
(Proc Chart to generate relative frequency histogram)
(Proc
Means
to generate
and
random number used (i.e.,
rannor).
When
summary
statistics)
x.
as a starting point for the ran-
the seed
is
changed, different val-
1.7 Applications
Using SAS
Computer Output Percer.r age
10 +
9
+
6
+
2
+
1
+
33322211110000001111222333 52074185207411470258147025 79013467901344310976431097 Z
Midpoint
129
Chapter 3 Probability
130
Percentage 11 + *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
10 +
9
7
6
+
+
+
5
+
3
+
2
+
1
+
***** ***** ***** ***** ***** ***** ***** ***** ****** ****** ****** ****** ****** ****** ******* ******* *******
***** ***** ***** ***** ***** ***** ***** ***** ****** ****** ****** ****** ****** ******* ******* ******* ******* ******** *******
******** ********* ********* ********* *********** *********** 111111111111111
45556677788999001112233344555 60482604826048260482604826048 X Midpoint
3.7 Applications Using SAS
131
The MEANS Procedure N
Mean
Std Dev
Minimum
Maximum
10000 10000
0.000430808 108.0060313
0.9914063 13.8796876
-4.4742924
3.6632627 159.2856784
Variable
X
-599064
Interpretation is a relative frequency histogram for the random The random variable z follows the standard normal distribution, with a mean of and a standard deviation of 1 The range of the observed values of z is listed along the horizontal axis and is approximately —3.0 to 3.0. Although the figure produced by SAS Proc Chart is somewhat crude, it does
The
part of the output
first
variable
c.
.
resemble the normal distribution curve.
The second
dom
figure displays the relative frequency
histogram for the ran-
The random variable x follows a normal distribution, with a mean of 108 and a standard deviation of 14. The range of the observed values of x are listed along the horizontal axis and is approximately 66 to 160. Again, the figure produced by SAS Proc Chart is crude, but it does variable x.
resemble the normal distribution curve. In fact, it is identical to the relative frequency distribution of c, only the values along the horizontal axis are different.
The next section of output is from the Means procedure. Shown are the number of observations in the sample (N= 10,000), the mean, standard deviation, minimum, .\\m\ maximum. Summary statistics are computed on both random variables ; and x. Notice that the observed mean of z is 0.0004 1; the theoretical mean of z is 0. The observed standard deviation of z is 0.99141; the theoretical standard deviation is 1. The observed mean of x is 108.006; the theoretical mean of .v is ION. The observed standard deviation of _
Z is
SAS Example
3.9
1
i.X
9~; the theoretical standard deviation
Computing S\stolic
Percentiles of the
Normal
Blood Pressures (Example
is
14.
Distribution
3.9)
unction (Returns Percentiles of the Standard Distribution I'robit
In
I
I
xample
3.9
we computed
blood pressures illustrated
in
the 5th
L
-*()th
percentiles of the systolic
the previous example.
we will use SAS to determine the 5th and 90th perblood pressures, which are assumed to follow a normal mean ot IDS \n^\ a standard deviation ot 14.
In this application,
centiles ot the systolic
distribution, with a
and
Normal
t
;
;
132
Chapter 3 Probability
Program Code options ps=64 ls=80; data one; mu=108; sigma=14 z05=probit (0.05)
Formats the output page to 64 lines
in length.
name
Beginning of Data Step (Data set
80 columns
in
width
one).
Create a variable called mu, assign the constant 108 Create a variable called sigma. assign the constant Creates a variable called Z05, which
;
14
is
assigned the 5th percentile of the
is
assigned the 90th percentile of the
standard normal distribution.
z90=probit (0.90)
Creates a variable called Z90, which
;
standard normal distribution.
x05=mu+ (z05*sigma)
Creates a
X05
new
variable, X05.
standard deviation of
x90=mu+ (z90*sigma)
Creates a
X90
is
which
is
a linear function of Z05 (x
the 5th percentile of a normal distribution with a
is
new
mean
=
/i
-
of 108 and a
14.
variable, X90,
which
is
a linear function of
Z90
the 90th percentile of a normal distribution with a
a standard deviation of
run;
End of Data
proc print;
Procedure
run;
End of Procedure section
(x
=
mean
\i
—
z*n
Suppose we increase our sample
=
=
a
fj.
10
_
10
Notice that under each sampling strategy the equal to the population mean. However,
when
=
2
mean
of the sample
the sample size
is
means
is
increased
from 4 to 25 the standard error (or variation in the sample means) is reduced from 5 to 2. When the samples are larger in size, there is less variation in the sample means; the sample means are more tightly clustered about the population mean.
4.2
The Central Limit Theorem The following mathematical theorem in statistics.
is
perhaps the most important theorem
4.2
Central Limit
we
153
Theorem
Suppose we have If
The Central Limit Theorem
take simple
a population with
random samples
mean n and standard
deviation a.
of size n with replacement from the
population, for large «, the sampling distribution of the sample means is
approximately normally distributed with
where,
in general,
n
> 30
is
sufficiently large.
The Central Limit Theorem (CLT)
we
will
single
make
important because
is
sample mean. The
CLT
bution of the sample means
mean
in statistical inference
based on the value of a states that for large samples (n > 30), the distri-
inferences about the population
(n
)
approximately normal. If the population is norn. If the population is binomial, then the
is
mal, then the results hold for any size
n{ 1 — p) > 5. In Chapter 3, we normal random variables by standardizing
following criteria are required, np > 5 and
computed
probabilities about
CLT tells
we can use that mean even if the population is not normal. This will be useful in statistical inference because when we make inferences about population parameters based on sample statistics, we (transforming to Z) and using Table B.2. The
same process
to
compute
will attach probability
To
statements that quantify the precision in our inferences.
reinforce the results of the Central Limit
illustrations. In
us that
probabilities about the sample
each illustration
we
Theorem, we now present three
display the population distribution along
with the sampling distributions of the sample means based on samples of size 5,
>,
1
30, and 50.
We
display the distributions graphically and present para-
The
meters associated with each
in
Figures 4. 1—4.3. Figures 4.
and 4.2 depict nonnormal populations; Figure 4.3
illustrates the
1
tabular form.
illustrations are presented in
normal population distribution. Figures 4.1a^Lld display the
sampling distributions of the sample mean based on s.r.s. of size n =5, n= 15, n = 30, and n = 50, respectively, from the population displayed in Figure 4.1. Similar graphs are shown for sampling distributions from the populations displayed in Figures 4.2 and 4.3. Notice for the normal population distribution
mean are normal for nonnormal cases (Figures 4.1 and 4.2), the sample means approach normality as the sample
(Figure 4.3), the sampling distributions of the sample
each sample
size considered. In the
sampling distributions of the size increases (e.g., n > 30).
The mean of
the uniform population
standard deviation of 57.7.
When
shown in Figure 4.1 is 100, with a random samples are drawn, the
simple
means of the sampling distributions for samples of each size are equal to 100. The standard deviations of the sample means, or the standard errors, decrease as the
sample
size increases (see
Table 4.2 and Figures 4.1a-d). Notice
the shapes of the sampling distributions of the sample
approximatelv normal
for
sample
sizes of
30 or
larger.
means
start to
how look
Chapter 4 Sampling Distributions
154
Figure
4.1 Uniform Population
2520-
15
10-
5-
75 80 85 90 95 100 105 110 115 120 125 200
Table 4.2 Uniform
Population Sample
Population
—
Population distribution
Oy:
Hx =
Sampling distribution
Ji?
fix= 100
a^= a^=
10.5
Sampling distribution
xi^
Hjl= 100
a^ =
8.2
Figure 4.1b Sampling
Distribution of
X,n=
15
15
iilllllttb. I
— sin = 25.9
[ix= 100
5
20-
I
a
r
20
I
100
Jl5
25-
I
=
Sampling distribution
25-
1
li
Sampling distribution
Figure 4.1a Sampling X,n = 5
U
SD
Mean
Size
I
I
I
I
I
I
I
I
I
70 75 80 85 90 95 100105110115120125130135
U
100
14.9
Distribution of
15
jriflb.
1
I
I
!
1
i
70 75 80 85 90 95 100105110115120125130135
4.2
Figure 4.IC Sampling
Figure
Distribution of
= 30
X. n
The Central Limit Theorem
155
4. id Sampling Distribution of X, n = 50
0^
r 70 75 80 85 90 95 100105110115120125130
70 75 80 85 90 95 100105110115120125130135
Figure 4.2 Skewed
Population
25 50 75 100125 150 175 200 225 250 275 300
Table 4.3 Skewed Population
Population Sample
Size
Sampling distribution
— JT
Sampling distribution
fs
Population distribution
SD
=
a = 100
li
100 KID
23
B
is:
Sampling distribution
Sampling distribution
Mean
fo
14 2
156
Chapter 4 Sampling Distributions
We now consider a the right, with
second nonnormal population, one which is skewed to most observations clustered at the low end of the distribution
(Figure 4.2).
The mean of
the
skewed population shown
standard deviation of 100.
When
simple
in
Figure 4.2
random samples
is
100, with a
are drawn, the
means of the sampling distributions for samples of each size are equal to 100. The standard deviations of the sample means, or the standard errors, decrease as the sample size increases (see Table 4.3 and Figures 4.2a-d). Again, notice how the shapes of the sampling distributions of the sample means start to look approximately normal for sample sizes of 30 or
The
larger.
example involves a normal population (Figure 4.3). Notice that the sampling distributions of the sample means are approximately normally
Figure 4.2a Sampling X,n = 5
last
Distribution of
Figure 4.2b Sampling
Distribution of
X,n=\5 70-
70 60 50 H 40 30 20
ion
25 50 75 100125150 175 200 225 250 275 300
Figure 4.2C Sampling X, n
=
Distribution of
30
25 50 75 100125150 175 200 225 250 275 300
25 50 75 100125150 175 200 225 250 275 300
Figure 4.2d Sampling X, n
=
Distribution of
50
25 50 75 100125150 175 200 225 250 275 300
4.2
Figure 4.3 Normal
The Central limit Theorem
157
Population
30-
70 75 80 85 90 95 100 105 110115 120 125 130
Table 4.4 Normal Population Population Population distribution
Sample
SD
Mean
Size
—
=
100
n^=
100
u
Sampling distribution
5
Sampling distribution
15
(ix= 100
Sampling distribution
30
Mx = 100
Sampling distribution
50
Mx=
Figure 4.3a Sampling X, »
Distribution of
100
Figure 4.3b Sampling
=5
X,«=
or
= 10
a^ =
4.5
=
2 6
CT
x
-
"x = l-8 a* =1.4
Distribution of
15
7060504030
-
2010-
——
i 70 75 80 85 90 95 100 105 110 115 120 125 130
i
70 75 80 85 90 95 100 105 110 115 120 125 130
158
Chapter 4 Sampling Distributions
Figure 4.3C Sampling Distribution of X,n = 30
Figure 4.3d Sampling X, n
1
i
1
=
Distribution of
50
r 70 75 80 85 90 95 100105 110115 120 125 130
70 75 80 85 90 95 100 105 110 115 120125 130
distributed for even small sample sizes
(e.g.,
n
—
5). In
Figures 4.1 and 4.2, the
sampling distributions of the sample mean looked normal only when the sample size
Example 4.3
was 30 or
greater.
Applying the Central Limit Theorem: Non-Normal Population Telephone calls placed to a drug hotline during the hours of 9-5 weekdays have a mean length of 5 minutes with a standard deviation of 5 minutes. The distribution of
all calls (i.e.,
X=
Suppose the population of three calls {n
=
3).
the population)
Duration of
is
5>
of the form:
minutes
very large, and
The following
My =
call,
is
we
take simple
are true by (4.1):
o-y
=
— =2.89
random samples
4.2
Similarly,
it
we
take
s.r.s.
of size n
=
The Central Limit Theorem
100, then by
5
Mx =
=
5,
(4.
1
159
(:
0.5
By the Central Limit Theorem the distribution of the sample means for size n — 100 is approximately normal as shown here:
samples of
The
phone calls (X) to the drug hotline was not normally distributed. The population
distribution of the lengths of
callers (the population)
for all distri-
bution was skewed to the right, with the majority of calls lasting a shorter
time and fewer calls lasting a longer time. calls at a calls
(X)
time (samples of n is
Most of
—
When we
look at collections of 100
100), the distribution of the
normallv distributed
(as
mean
length of
shown).
the topics addressed in subsequent chapters deal with statistical
sample from the population of interest. We summake inferences about unknown population parameters (e.g., //) based on sample statistics (e.g., X). Without knowing the population distribution, as long as the sample is sufficiently large (usually size inference based
on
a single
marize the sample and then
50
is
sufficient),
bilistic
we can appeal
to the Central Limit
Theorem
make probamean and the Theorem states with mean ,\nd to
statements about the relationship between the sample
unknown population mean, lor example, the Central limit mean is approximately normally distributed standard deviation //^ and n v respectively. We can make statements about
that the sample
means
the distribution of sample
normal distribution using
(4.ii
Z=
The following example
ot size n by
and Table B.2
transforming to the standard (see Section
J.6).
X- n a
illustrates the use ot the
(
I
I
and formula
l6o
Chapter 4 Sampling Distributions
Example 4.4
^Applying the Central Limit Theorem: Normal Population — 100 and o = 10. Suppose we have a normal population with /j.
random sample
a simple
If
we take mean
of size 225, find the probability that the sample
between 99 and 102. The problem of interest
falls
stated as follows:
is
X
30),
uted with
Z=
X- Mx _ X-Mo
(5.12)
where no is the mean specified in Hq (i.e., no = 130) From Chapter 3, we know the properties of Z. If Z (5.12) is close to zero, which occurs when X is close to //o = 130, we suspect that Ho is most likely true. When Z is large, which occurs when X is larger than iaq = 130, we suspect that Hi is most likely true. In hypothesis testing, we need to determine the point at which Z is "too large." That point is called the critical value ofZ. Here we know that a = 15 and n =108. Therefore, under the null hypothesis (i.e., when \x = 130), the distribution of the sample means is ap= 130 and standard deviation proximately norm al w ith mean jj.y- = /j.
a-
=
= 15/7108 = 1.44 (Figure 5.5). = 130, it is possible to observe Hq:
in
Figure 5.5. However,
a/y/n
Under
\x
we know
that
it
is
any value of
unlikely
(i.e.,
X
displayed
the probability
is
X will take on a value in the tails of the distribution. For example, unlikely to observe values of X exceeding 132.88 (which 2 stan-
small) that
very
it is
dard
is
deviations
P(X>
132.88)
serve a sample
+ 2a^). Recall from Chapter 4, -0.9772 = 0.0228. Therefore, if we ob-
above the mean: //^
= P(Z>
mean
2)
=
1
that exceeds 132.88
and we
favor of the alternative, the probability that rejecting
is
only 2.28%. However,
if
we
reject the null hypothesis in
we
are
making
a mistake in
reject the null hypothesis for values
X
we are making a mistake in rejecting 1-0.8413 = 0.1587. We must decide of this type where we incorrectly reject
that exceed 131.44, the probability that
Ho
P(X>
is
what
131.44)
= P(Z>
1)'=
level of error (specifically, error
the null hypothesis)
making
this
we can
type of mistake
tolerate in the analysis.
may
A 15.87%
probability of
be too high.
In hypothesis testing, investigators select a level
of significance, denoted a,
defined as the probability of rejecting Hi when Hq is true. The level of significance is generally in the range of 0.01 to 0.10, though any value from
which
is
to 1.0 can be selected. Because the level of significance reflects the likelihood
of drawing an erroneous conclusion levels of significance are
Once a level of The decision rule is
(i.e.,
a
— P (reject
Ho Ho I
true)), small
purposely selected.
is selected, a decision rule is formulated. formal statement of the criteria used to draw a conclusion in the hypothesis test. For example, suppose we select a level of significance of 5% (i.e., a = 0.05, we allow a 5% chance of rejecting Hi when
H)
is
true).
The
significance a
critical
value and decision rule are displayed graphically in
y.
2 Hypothesis
Tests
About
193
/
130,a = 0.05 ,,:
Figure 5.5
X under = L30(a5 = 1.44)
Distribution of H>:
/
not reject Hj
Once the The test
given by
is
1.645
Z
1.645. A final statement is made concerning our findings relative to the research or alternative hypothesis. Such a statement is as follows: We have significant evidence, a = 0.05, that the mean systolic blood pressure for males aged 50 in 2004 has S( increased from rejection region
1
Table I
5. ]
).
summarizes the
steps involved in the test of hypothesis procedure.
he steps are displayed with reference to a
The same
fable 5.4 contains critical values of tests.
test
about the population mean
//.
steps will be used to test hypotheses concerning other parameters.
The general form
of the
the decision rule for each type [able 5.5
/
tor lower-, upper-,
and two-tailed
hypotheses are presented along with the form of t
summarizes three
test.
different formulas tor test statistics in tests
concerning the population mean should be applied.
//
and the conditions under which each
— 194
Chapter 5
Procedures for
Statistical Inference:
Table
(i
5*3 Tests of Hypothesis Concerning
pi
Example
Step
Set
1.
H
up hypotheses.
:
/z
=
H,:
/i
> Mo*
a
Select level of significance.
tci 2. Select appropriate test statistic.
=
/x
0.05
X-mo —
Z=
Z^ H„ if Z
) specified
the alternative hypothesis and the location of the rejection region (see Figure 5.6). The other
of tests are called lower-tailed tests is
is
of the form: Hy/x
< n ;and
in
and two-tailed
tests. In lower-tailed tests,
two-tailed tests, the alternative hypothesis
lower-tailed tests, the research hypothesis indicates a decrease
in
the
mean
research hypothesis indicates a difference (either an increase or a decrease)
Notice (n
1
200
Chapter 5
Statistical Inference:
5.
Procedures for n
Conclusion. In the final step,
(computed
H
not reject
we draw
,
is
by comparing the
because —2.262 < —1.02 < 2.262.
significant evidence,
females
a conclusion
test statistic
in step 4) to the decision rule (displayed in step 3).
a
=
0.05, to
significantly different
show
that the
We do
mean
Do
not have
starting salary for
from $29,500. Are the
starting salaries
the same?
Mean
SAS Example 5.10 Testing
Starting Salary Against Referent Using
SAS
The following output was generated using SAS Proc Means with an option conduct a
SAS /J.
A
to
brief interpretation appears after the output.
conduct a one-sample test of hypothesis in its Means procedure. one-sample test SAS assumes that the test of interest is versus Hi: jx 7^ 0. In this application (and in most others), we are
will
However, Ho:
test of hypothesis.
=
in the
mean of the analytic variable is zero. We want among females is 29.5 (in $1000s, which is equal to $29,500). In order to use SAS to test the desired hypotheses, we create a new variable, which we call TESTSTAT; it is simply our original analytic variable (i.e., starting salary) minus 29.5. Using the variable TESTSTAT, we can use SAS to test the desired hypotheses. In particular, if the mean of TESTSTAT is significantly different from zero, then we can conclude that the mean salary is significantly different from 29.5 (since TESTSTAT is simply equivalent to salary —29.5). Conversely, if the mean of TESTSTAT is not significantly different from zero, then we can conclude that the mean salary is not interested in testing
to test
if
the
mean
if
the
starting salary
not significantly different from 29.5.
SAS Output
for
Example 5.10
The MEANS Procedure N Mean
Variable Label
salary Annual Salary in $000s teststat
10 10
28.2000000 -1.3000000
Variable
Label
salary teststat
Annual Salary in $000s
Interpretation of
t
SAS Output
for
Std Dev
Std Error
4.0496913 4.0496913
1.2806248J
Value 22.02 -1.02
Pr >
1.2806248 It
I
= 0.0194. The graphical
value
is
the
probability ot
202
Chapter 5
Statistical Inference:
Procedures for n
observing a value as extreme or more extreme than the observed that
is,
P(t
> 2.50
or
test statistic;
< -2.50).
t
p = 0.0194
0.0097
t= -2.50
If this test
f=2.50
were conducted by hand,
critical values
would be
selected
corresponding to the preselected level of significance a. The critical values for a two-sided test with a = 0.05 are shown next. Because the test statistic t
=
2.50 exceeds the in favor of
Ho from 30. ject
critical
value
t
=
2.064
H\ and conclude that the
-2.064
The same conclusion
t
(see figure),
mean
is
we would
re-
significantly different
= 2.064
Slgreached by comparing the p value to the level of sig p = 0.0194 < a = 0.05, we rej H>. (Note: For the same test, if a = 0.01, we would not reject H. is
nificance using rule (5.14). Because
I
Notice that when comparing the
test statistic to
a critical value, actual
t
compared, whereas when comparing the p value to the level of significance, areas in the tails of the distribution are compared. = no versus Hi: n > Mo> For the one-sample tests of means (Hq: statistics are
/j.
Hi:
n
30.
COMPUTING p VALUES BY HAND Mean
Example 5.8 Revisited: p- Value in Test for
Example
In
5.8,
The decision
rule
Reject Ho
Do
we
if
we used was
Z>
not reject Hi
w,
We computed
Systolic
Blood Pressure
ran the following test at a Hq:
m=
Hi:
fi> 130
// ,,. In this
level of significance.
130
given by
1.645 if
Z < .
.
1.645
,~ = X-li — = 135-130 = Z
a test statistic of
,
where we
1.645.
still
The p value
reject
//.
.
I
Ising
204
Chapter
5 Statistical Inference:
the table, if
we
Procedures for
we examine each
could
still
reject
Hq
\i
than 0.05 to determine = 0.025 we still re-
level of significance smaller
at that level.
For example, at a
Hq because 3.46 > 1.960. At o- = 0.01 we also reject Hq because > 2.326, at a = 0.005 we reject Hq because 3.46 > 2.576, at a = 0.001 we reject Hq because 3.46 > 3.090, but ata = 0.0001 we cannot reject Hq because 3.46 < 3.791. Therefore, the smallest level of significance where we still reject Hq is 0.001. The significance of this data, or the p value, is 0.001. If we run this analysis using SAS, SAS produces an exact p value. Our hand computations produce only an approximate value (in fact, the exact p value is between 0.0001 and 0.001). To reflect the idea that this is an approximate p value, sometimes the value is reported as p < 0.001. ject
3.46
5.2.3
Power and Sample
Size Determination
There are two types of errors that can be committed in hypothesis testing, a Type I error (i.e., reject Hq when Hq is true), or a Type II error (i.e., do not reject Hq when Hq is false). In Section 5.2.1 we introduced a = F(Type I error) = P(Reject Hq\Hq true) and ft = P(Type II error) = P(Do not reject Hq\Hq false). In each test of hypothesis, we specify a, purposely choosing small values (e.g., 0.01, 0.02, 0.05, or 0.10) so that the P(Type I error) is controlled. The probability of a Type II error, f5, is difficult to control because it depends on several factors. In fact, one of the factors on which /3 depends is a: ft decreases as a increases. Therefore, one must weigh the choice of a lower /? = P(Type II error), which is desirable, against a higher level of significance, which is undesirable. In hypothesis testing, we are concerned with the power of a test, defined as 1
—
p.
The power of
a test
Power As power fore
power
increases,
/3
is
-$=
1
its
ability to "detect" or reject a false
defined as
P( Reject Ho\Ho false)
decreases, resulting in a better
a complicated function of three
n a.
test.
(5.16)
Power (and
there-
components:
= sample size — level of significance = P (Type error) ES = the effect size = the standardized difference
1.
1. 3.
y0) is
=
defined as
is
null hypothesis. Specifically,
I
in
means
specified
under Hq and Hi
The power of
a particular test
larger level of significance,
is
and
higher (or better) with a larger sample a larger effect size. In this section,
we
duce the concept of statistical power as it applies to the one-sample hypothesis about n and present a simple application. Suppose we are interested in the following test. Hq:
fi
=
100
Hi:
ix
>
100,
a
=
0.05
size, a
introtest
of
v2 Hypothesis
To conduct
the test,
tion of interest
hypothesis
Assume
we
=
(i.e., if /*
that o^-
=
null
About n
205
subjects from the popula-
X. Under the null means is as follows.
statistics, in particular
100), the distribution of sample
6.
Suppose we want 110.
random sample of
select a
and analyze summary
Tests
The following
to determine the
power of
the test
if
displays the distributions of the sample
mean is mean under the
the true
and alternative hypotheses.
X under
X under
H,
80
120
H,
130
We now add or = P(Type error) = P( Reject //. II. true), the corresponding critical value, fi = P(Typc II error) = P(Do not rejeci // // false), and power = 1-/1= P (Reject //. //. false) to the figure, showing the distributions of the sample mean under // >(/i 100) and //t 1.96-
|8
°'^ = P(Z> 1.96-2.36)
9.5/^20/
V
= P(Z >
-0.40)
=
1
- 0.3446 = 0.6554
65% probability that this test will detect a difference of — 20 at a = 0.05. The power is the same if /i = 75. What if the sample size is increased to n = 50?
There
is
a
5 units in
means with n
Power
There
is
a
96%
=
P[
Z>
Zi_ (a /2)
-
-( Z >
1.96
= P(Z>
-1.76)
or/Vw
180-851
=[ = P(Z > ,
1.96
-
3.72)
9.5/^50
=
1
- 0.0392 = 0.9608
probability that this test will detect a difference of 5 units in
with n — 50 at a = 0.05. Suppose we go back to n = 20 and consider a 3-point difference
means (i.e.,
(in either direction)
mi
=
83):
d Power
= Pt>( Z7 I
iMo
7 (a/ 2) > Zi_
= p(z>1.96V
= P(Z>
|80
—
~^
Mil
l
')
= P(Z>
9.5/v'20 ) 0.55)
=
1
- 0.7088 = 0.2912
1.96-1.41)
in
means
5.2
There
is
only
c)
a 2
l
\>
209
Hypothesis Tests About n
probability that rhis rest will detect a difference or 3 units
means (in either direction) with ;; = 20 at a = 0.05. This is a small difference in means and with a small sample size we are not very likely to detect it. A larger sample would he required to ensure a higher probability or detecting in
such a difference. In the following we describe techniques for determining the sample size required to ensure a specified power. In main applications, the number of subjects that can be sampled depends on financial and/or time constraints. In other cases, the investigators can choose a sample large enough to ensure a certain level of power. As described in Section 5.1.3, techniques from experimental design can be employed to determine the number of subjects required to achieve a certain level of power prior to mounting the study. The sample size required to ensure a specific level of power in a two-sided test
is
where
Zj
„
2
'
s
from the standard normal distribution with
the value
lower-tail area equal to
Z|_^
is
1
—
ce/2
the value from the standard normal distribution with
lower-tail area equal to
1
—
/J
ES is the effect size, defined as the standardized difference means under the null and alternative hypotheses (5.20):
£S= where
fio is
the
H\
the
n
is is
mean mean
l(Ml
-
in
//0)l
(5.20)
a H,
specified in
specified in H\
the population standard deviation of the characteristic under
investigation.
The sample test
size required to
ensure a specific level of power
in a
one-sided
is
*-(MM where Zi_„
is
the value from the standard normal distribution with
lower-tail area equal to Zi-fi
is
lower-tail area equal to I
5 is
1
-a
the value from the standard
the effect size
To implement formulas required to ensure
a
I
1
-
normal distribution with
fl
5.20)
(5.19)
certain level of
and (5.21)
power
to
compute
the
sample
size
to detect a specified difference
m
2IO
Chapter
5 Statistical Inference:
Procedures for n
means, several inputs are required.
we must
First,
usually straightforward because a
cance, a. This
is
dard. Second,
we must
specify
/}
specify the level of signifi-
=
0.05
is
considered stan-
=
P(Type II error). In many experimental 0.20, which reflects 80% power. With
design applications, P is set to — 0.20, there is an 80% chance of rejecting a false null hypothesis relative
ft
to a specific effect size. (In
The
power.)
some
instances,
third input, the effect size,
/S is
is
set to 0.10,
the
most
which
90%
reflects
difficult to specify.
The
magnitude of a clinically important difference in means. The effect size is best determined by an expert in the substantive area under investigation. In order to compute the effect size, we also need to quantify the variation in the characteristic under investigation {a). If no such value exists, the same options outlined in Section 5.1.3 can be used to determine a reasonable approximation. The following example illustrates the computations. effect size reflects the
Example 5.14
Power
in Test of
Hypothesis for
Mean
Suppose we wish to conduct the following two-sided
test at a
5%
level
of
significance:
Ho:
n
=
100
Hi:
ix
#
100
Suppose that a difference of 5 units in the mean score is considered a clinically meaningful difference. If the true mean is less than 95 or greater than 105, we do not want to fail to reject the null hypothesis. How many subjects would be required to ensure that the probability of detecting a 5-unit difference
power
(i.e.,
=
Because we wish to conduct a two-sided
we compute
ate. First,
in
Ho
=
(ixq
use either
jx\
100), the
=
95 or
dard deviation a
=
mean we wish /x\
=
formula (5.19)
test,
to detect
a
when Ho
is
is
80%
—
9.5.
is
appropri-
mean
false (here
specified
we
could
105, both produce the same result), and the stan-
9.5:
=
i(io5-ioo)i
=0S26
substitute the effect size into formula (5.19):
"=( We
that
the effect size (5.20) by substituting the
ES
We now
we know
0.80)? For this example, suppose
Z 0.9-5 +
Zq.80
X
'
:
0.526
use the standard normal distribution table (Table B.2) to determine
Z080
Z037l
By definition, Z097S is the Z value that holds 0.975 below it (or 0.025 above it, in the upper tail) and Z 80 is the Z value that holds 0.80 below it in the standard normal distribution, shown in the next figure:
and
.
5.2
-
1
fi
Hypothesis Tats About
211
\i
= 0.80
Zr\ an Zr,
Using Table B.2 and the techniques Z0.975
= 196
and
Zq.so
=
0.84.
/ 1.96
"=( A
sample of
size
ensure that power
means.
in
80% H,:
//
If
+ 0.84 V
0.526
5.12
2
=28.33
mean
at least 5 units different
is
will
lead
to rejection
next integer) will 5-point difference
from 100, there
is
an
of the null hypothesis
100.
power of 80% to detect and other scenarios can be investigated by
be required to ensure a
difference of 3 units? This
Power
(5.323)
we always round up to the 80%, p = 0.20) to detect a
substituting into the formulas just
SAS Example
=
(again,
test
we determine
J
0.80 (or
the true
3,
substitute these values:
=
How many subjects would a
described in Chapter
29
chance that the
=
we
We now
in Test of
shown
Hypothesis for
Mean
or by using SAS.
Using SAS
The following output was generated using SAS to determine the sample size required to ensure a specified power (we considered scenarios with 80% and 90% power) and differences in means of 5 and 3 units. A brief interpretation appears after the output.
SAS Output Obs 2
4
alpha 0.05 0.05 0.05 0.05
for
Example 5.14
beta
muO
0.2 0.1 0.2 0.1
100 100 100 100
Interpretation of
There
is
mul 105 105
no SAS procedure
power
es
9.5 9.5 9.5 9.5
0.8 0.9 0.8 0.9
0.52632 0.52632 0.31579 0.31579
103 103
SAS Output
subjects required to detect
sigma
for
specifically .1
specific
n 2 29 38 79
106
n
1
23 31 62 86
Example 5.12 designed to determine the number oi effect
in
the
mean
of
.1
population.
212
Chapter
5 Statistical Inference:
Procedures for n
However, similar to the approach taken in SAS Example 5.7, SAS can be used to program appropriate formulas. Once the formulas are implemented, users can evaluate different scenarios easily. In the example shown here, four scenarios are considered (denoted
Obs
1-4, respectively). Scenario
1
corre-
sponds to the situation presented in Example 5.12. Five variables are input into the program; the level of significance (alpha), the power (power), the mean under the null hypothesis (muO), the mean under the alternative hypothesis (mul), and the standard deviation (sigma). Several variables are created in the program and the values of all variables are printed in the output. A description of the variables and an interpretation of results follows. In scenario 1 (Obs = 1), the level of significance (alpha) is set at 0.05 (5%) and the power was specified at 0.80 (80%); the probability of Type II error, fi, is computed to be 0.20 (20%); the mean under the null hypothesis (muO) was specified as 100; the mean under the alternative hypothesis was specified as 105 (mul); and the standard deviation (sigma) was specified at 9.5. The effect size (es) was computed by dividing the absolute value of the difference in means under the null and alternative hypothesis by the standard deviation. In scenario 1 (Obs =1), the effect size is 0.52632. Twenty-nine subjects (n_2) are required to ensure that the probability of detecting a 5-unit difference in
means
is
80%
(i.e.,
5%. Twenty-three
power
=
0.80), with a two-sided level of significance of
subjects (n_l) are required to ensure that the probability
of detecting a 5-unit difference in significance of
means
80%, with
is
a one-sided level of
5%. (Obs
In scenario 2
—
2),
we
power
increase the
to 0.90
(90%). Thirty-
eight subjects (n_2) are required to ensure that the probability of detecting a 5-unit difference in
of
5%. Thirty-one
means
90%, with
is
subjects (n_l
)
detecting a 5-unit difference in significance of
means
is
90%, with
a one-sided level of
5%.
In scenario 3
mean under
a two-sided level of significance
are required to ensure that the probability of
(Obs
=
3),
we
input a power of 0.80 (80%) but decrease the
the alternative hypothesis (mul) to 103,
which decreases the
0.31579. Seventy-nine subjects (n_2) are required to ensure that the probability of detecting a 3-unit difference in means is 80%, with a effect size (es) to
two-sided
test
and
5%
level of significance.
Sixty-two subjects (n_l) are
required to ensure that the probability of detecting a 3-unit difference in
means
is
80%, with
In scenario 4
mean under
and 5% level of significance. power of 0.90 (90%) and consider the hypothesis (mul) as 83. One hundred six subjects
a one-sided test
(Obs
=
4)
the alternative
we
input a
(n_2) are required to ensure that the probability of detecting a 3-unit differ-
ence in means
is
90%, with
a two-sided test
and
5%
level of significance.
Eighty-six subjects (n_l) are required to ensure that the probability of detecting a 3-unit difference in significance.
means
is
90%, with
a one-sided test
and
5%
level of
J.
Nonce
that
more
4
Statistical
Computing
subjects are required to ensure a higher power.
subjects are also required to detect a smaller effect si/e. These results
be weighed against practical constraints sample size for the application.
5.3
to
213
More would
determine the most appropriate
Key Formulas Notation/Formula
Application
X± ZlHa/ 2|-p
Confidence interval estimate tor
Description See Table 5.2 for alternate
formulas
//
Z\
I
a/2)0
Find n to estimate n
Sample
(find
si/e to
Z
in
Table
S.l
ensure margin
of error E with confidence level reflected in
Z=
Test statistic tor
See Table 5.3 for hypothesis
5/sA
=
H,: n
Z
testing procedure. Table 5.4 tor critical values of Z, and Table 5.5 for alternate formulas
Find power for of
Find H,:
hi,:
Power =
test
= Ha
=
"
/i,i
p value
_ "
/
Power of two-sided \
for
Z,.
Z|
,„
where ES
=
H
if
test
mean
Z
:
iMi
Sample
'
ES
I
Reject
-Mil
Imp
/'(/>
to test
'/
/
,S00?
;
2l6
;
Chapter
;
;
Procedures for
5 Statistical Inference:
\x
Program Code options ps=62 ls=80;
Formats the output page to 62 lines in
data in; input salary; teststat=salary-29
in length
and 80 columns
width
Beginning of Data Step. Inputs variable salary. .
Creates a
5
new
variable, called teststat, by subtracting 29.5
(//„)
from each salary.
label salary = 'Annual Salary in $000s' cards
Attaches a descriptive label to salary.
Beginning of
32 27
Raw Data
section.
actual observations
31 27 26 26 30 22 25 36 run;
proc means n mean std stderr
t
prt;
Procedure
call.
Proc Means generates summary
continuous variables. Certain statistics
(see
var salary teststat;
statistics for
are requested-
and p values for conducting the
f
tests of hypothesis
Interpretation)
(iii)
Specification of variables.
run;
SAS Example
statistics
End of procedure section.
5.12
Determine the
Number
of Subjects Required to Detect a Specific
Effect Size in a Test of Hypothesis
About
(Example 5.12)
fi
Sample Size Requirements
We
wish to conduct the following Ho:
How many
subjects
ix
=
100
90% power
is
5%
H,:
vs.
would be required
tecting a 5- (or a 3-) unit difference
scenarios with
test at the
n
level of significance:
^
100
to ensure that the probability of de-
80%
(i.e.,
and assume that a
power
=
=
0.80)? Also consider
9.5.
Program Code options ps=62 ls=80;
Formats the output page to 62 lines
80 columns
in
in length
and
width
data in; input alpha power muO mul sigma; z_alpha2=probit (l-alpha/2
Determines the value from the standard normal
z_alphal=probit (1-alpha)
Determines the value from the standard normal
beta=l -power;
Computes beta.
)
Beginning of Data Step. Inputs
5
variables alpha,
power, muo,
mm
and sigma.
distribution with lower-tail area i-alpha/2 (see Z a
distribution with lower-tail area i-alpha (see
.
above)
Z a above)
Computing
5.4 Statistical
Determines the
;-a=probit 1-beta) (
\c means n
alpha
=
(
Description
all
mean std nun max dm;
generates a 100(1
mean
conducts
0.05
./.;
confidence interval tor
/
(Y\>
=
adherence
among
to medication therapy (initiated within the is
measured as the percent of prescribed
month
(e.g.,
100% =
perfect adherence
half ot prescribed doses taken).
of 75 HIV-infected patients
new
—
all
A random sample
to medication therapy agree to
Each reports their medication regimen and the doses they took over the past month. The mean percent adherence is 78% with a standard deviation of 7.2%. Construct a 95% confidence interval estimate of the mean percent adherence for all HIV-infected patients new to medication therapy. participate in the study.
fc£/j
*\~ 21. The mean lung capacity for nonsmoking males aged 50 is 2 liters. An investigator wants to examine if the mean lung capacities are significantly lower among former smokers of similar backgrounds (i.e., males aged 50 who smoked in the past and are not currently smokers). A random sample of 60 former smokers is selected. Their lung capacities have a mean of 1.8 liters with a standard deviation of 0.2"7
22.
liter.
Among
Run
the appropriate test at a 5".. level of significance.
private universities in the United States, the
students to professors
is
Vs. 2 (i.e.,
mean
ratio of
35.2 students for each professor)
with a standard deviation of 8.8. a.
What
is
the probability that
private universities that the
random sample
in a
mean
of 50
student-to-professor ratio
exceeds sS? b.
random sample of 50 universities is selected and mean student-to-professor ratio is 38. Is there evidence that the reported mean ratio actually exceeds 35.2? Use u =0.05. Suppose
a
the observed
23.
[Tie
recommended
daily allowance
iRDAi
of iron tor adult females
under the age of 51 is 18 nig. We wish to test it females under age are, on average, getting less than IS mg. A random sample ot 4S females between the ages of IN and 50 is selected. The average iron intake
appropriate 24.
It
16.4
mg
with
test at
the
5%
is
a statistical test
is
also be rejected at a
2s.
\
a
5
1
standard deviation ot 4.1 mg. Run the
level ot significance.
performed and //is rejected
at
a.
=
0.01, will
it
= 0.05?
journal article reported that the
particular surgical procedure
in
mean
hospital st.n following a
2001 was
7.1 days.
A researcher
feels
226
Chapter 5
Procedures for n
Statistical Inference:
mean hospital stay in 2002 should be less due to initiatives aimed at reducing health care costs. A random sample of 40 patients undergoing the same surgical procedure in 2002 had a mean length
that the
of stay of 6.85 days with a standard deviation of 7.01 days.
appropriate 26.
A
statistical test at
a
=
Run
the
0.05.
consumer group
is investigating a producer of diet meals to examine prepackaged meals actually contain the advertised 6 ounces of protein in each package. Based on the following data, is there any evidence that the meals do not contain the advertised amount of if its
protein?
27.
Run
the appropriate test at a
5%
level of significance.
5.1
4.9
6.0
5.1
5.7
5.5
4.9
6.1
6.0
5.8
5.2
4.8
4.7
4.2
4.9
5.5
5.6
5.8
6.0
6.1
An
article
HIV have CD4 tests Boston Medical Center is
reported that patients under care for
every 3 months, on average.
A concern
at
is a longer lag between tests. To test the concern, a random sample of 15 patients currently under care for HIV is selected and the time between their two most recent CD4 tests is recorded. The mean time between tests is 3.9 months with a standard deviation of 0.4 month. Run the appropriate test at a 5% level of significance.
that there
A
mean
blood pressure for patients with 125 with a standard deviation of 15. We wish to design a study to evaluate an experimental medication for reducing blood pressure. How many subjects would be required to detect a 10-unit reduction in systolic blood pressure with 80% power? Assume that a two-sided test will be run at a 5% significance level. study reports that the
systolic
a history of cardiovascular disease
29.
We
is
test the hypothesis that the mean weight for females who 140 pounds. Assuming o = 15, using a 5% level of = 150. significance and with n — 36, find the power of the test if
wish to
are 5'8"
is
/j.
(Use a two-sided test of hypothesis.) 30.
31.
We wish to run the following test: Hq: = 100 versus Hi: n # 100 at a — 0.05. If a — 10, how large a sample would be required so that P = 0.04 if /i = 110? In a normal population with a = 5, we wish to test Hq: h = 12 versus Hi: # 12 at a = 0.05. With a sample of 64 subjects, what is the probability of rejecting Ho if n — 14? If \x = 9? /j.
ij.
32.
Results of an industry survey in the computer software field find that the
mean number
of sick days taken by employees
a standard deviation of 2.7 per year.
company
local
is
9.4 per year with
computer software
employees take significantly fewer sick days per year. sample of 15 employees is selected from the local company
feels its
A random
A
5.6
227
Problems
and attendance records are reviewed. The following data represent the numbers of sick days taken by these employees over the past year: 8
10
5
5
4
3
Run 33.
An
9
5
4
15
6
2
the appropriate test at a
analysis
is
15
5%
level of significance.
GRE
conducted to compare the mean
scores
among
seniors in a local university to the national average of 500. Use the
SAS
output shown to address the following questions.
Variable
N
Mean
Std Dev
Std
GRE
250 250
512.0463595 12.0463595
86.2844894 86.2844894
5.4571103 5.4571103
TESTSTAT
a.
Is
the
GRE
mean
score
among
from the national average? b.
Can we
93.835 2.207
34.
An academic medical
all
0.0001 0.0282
the local seniors significantly different
score
among
all
parts of the test).
the local seniors
significantly higher than the national average?
conclusion with data (show
Prob>|T|
T
Justify briefly (show
mean (IRE
say that the
Error
is
Support your
parts of the test).
2002 to was measured on a scale of to 100, with higher scores indicative of more satisfaction. The mean satisfaction score in 2002 was 84.5. Several quality-improvement initiatives were implemented in 2003 and the center surveyed
all
of
its
patients in
assess their satisfaction with medical care. Satisfaction
medical center
is
wondering whether the
A random sample
satisfaction.
initiatives increased patient
of 125 patients seeking medical care in
2003 was surveyed using the same satisfaction measure. Their mean satisfaction score was 89.2 with a standard deviation of 17.4. Is there evidence of a significant improvement in satisfaction? Run the appropriate
—
test at the
5%
level of significance.
SAS Problems Use SAS 1.
to solve each of the following problems.
\ study is conducted to assess the extent to which patients who had coronary artery bypass surgery were maintaining their prescribed
exercise programs.
The following data
purposes of
this study, exercise
lasting at least
14
11
6
13
20 minutes
12
in
was defined
duration.
S3 14
numbers of tunes month (4 weeks For
reflect the
patients reported exercising over the previous as
1.
moderate physical
the
activit)
228
Chapter
5 Statistical Inference:
Procedures for
Use SAS Proc Means
fx
summary statistics on the numbers of month and a 95% confidence mean number of times patients exercised
to generate
times patients exercised over the previous interval estimate for the
following surgery.
We
wish to design a study to estimate the mean of a population. We wish and to estimate the required sample size for each. Use SAS to determine the sample sizes required for each scenario. Consider margins of error of 5, 10, and 20; confidence levels of 90% and 95%; and standard deviations of 55 and 65 (for a total of 3 x 2 x 2 = 12 scenarios). to consider several scenarios
The following data were
collected
from
a
random sample of 10 asthmatic number of days each
children enrolled in a research study and reflect the child missed school during the past 3 months:
12
6
14
3
2
7
4
10
8
Use SAS Proc Means to generate summary confidence interval for the mean.
statistics
6
and request a
95%
A consumer group its
in
is investigating a producer of diet meals to examine if prepackaged meals actually contain the advertised 6 ounces of protein each package. The group collected the following data:
5.1
4.9
6.0
5.1
5.7
5.5
4.9
6.1
6.0
5.8
5.2
4.8
4.7
4.2
4.9
5.5
5.6
5.8
6.0
6.1
Use SAS Proc Means to generate summary statistics on the ounces of protein contained in the packaged meals. In addition, run a test to determine if there is any evidence that the meals do not contain the advertised
amount
of protein.
Run
the appropriate test at a
5%
level of
significance.
wish to design a study to test the following hypotheses: Ho: ^ = 100 # 100. We wish to consider several scenarios and to estimate the required sample size for each. Use SAS to determine the sample sizes required for each scenario to ensure power = 80%. Consider means under the alternative hypothesis of 90, 95, and 120; levels of significance of 0.05 and 0.01; and standard deviations of 7 and 10 (for a total of 3 x 2 x 2 = 12 scenarios).
We
versus H\:
/j.
Results of an industry survey in the computer software field finds that the
mean number
of sick days taken by employees
a standard deviation of 2.7 per year.
A
local
is
9.4 per year with
computer software company
employees take significantly fewer sick days per year. A random sample of 15 employees is selected from the local company and feels its
attendance records are reviewed. The following data represent the
229
5.6 Problems
numbers of 8
10
5
5
4
3
days taken by these employees over the past year:
sick
6
2
9
5
4
15
15
Use SAS Proc Means to generate summary sick days taken is
statistics
by employees. In addition, run a
on the numbers of
test to
any evidence that the employees take fewer than 9.4
Run
the appropriate test at a
performs a two-sided conclusion.)
test;
5%
make
level of significance.
the adjustment to
determine sick
if
there
days per
year.
(Note:
draw your
SAS
Descriptive Statistics (Ch. 2)
Probability (Ch. 3)
Sampling Distributions (Ch. 4)
Statistical
Inference
(Chapters 5-13)
Continuous
—
Continuous
Dichotomous
Estimate u:
(2
groups)
Compare Independent Means
Discrete (> 2 groups)
-
/',))
or the
Mean
Test the Equality of k
Variance
Continuous
to
u.
Known, 5/12
(/',
Continuous
Compare
Historical Value
(j/,
=
(Estimate/Test
Means using Analysis of yu
2
=
•
•
•
=
/i
9/12
t)
Estimate Correlation or
Continuous
,,
Difference*/',,)
10/12
Determine Regression Equation Continuous
Several Continuous or
Multiple Linear Regression Analysis
Dichotomous Estimate
Dichotomous
p;
Compare p
to
Known,
Historical Value
Dichotomous
Dichotomous
(2
groups)
Compare Independent Proportions (Estimate/Test
Dichotomous
Discrete (> 2 groups)
(p,
—
p 2 ))
Test the Equality of k Proportions
(Chi-Square Test)
Dichotomous
Several Continuous
Multiple Logistic Regression Analysis
or Dichotomous
Discrete
Compare
Distributions
Among
Discrete
(Chi-Square Test) Several Continuous
Time
to Event
Survival Analysis
or Dichotomous
k Populations
7/8
Statistical
Inference:
Procedures for 6.1
Statistical Inference
6.2
Power and Sample
6.3
Key Formulas
6.4
Statistical
6.5
Analysis of
6.6
Problems
(/1,-jO
Concerning Size
(/i)
—
fii)
Determination
Computing
Framingham Heart Study Data
*3'
232
Chapter 6
Statistic\il Inference:
We now
Procedures for
(
//
1
- M2)
describe statistical inference procedures
son groups and the outcome of interest
we compare means between
is
when
groups. In Chapter 5
we
ence procedures for a single sample (estimation of an of hypothesis about the
mean
there are
two compari-
a continuous variable. In such cases,
described statistical infer-
unknown mean and
tests
Two
sample applications are extremely common. For example, recall Example 5.9 in which a diet was evaluated for its ability to reduce cholesterol levels. In the example, a single sample of 12 individuals followed the diet for 3 months. At the end of the 3-month observation period, cholesterol levels were measured and compared against a known (or historical) value. In that example, we did not find statistically significant of the population).
evidence of a reduction in cholesterol attributable to the
diet.
We made
the as-
sumption that the mean cholesterol level for males aged 50 not following the diet was 241, and we observed a mean cholesterol level in our sample of 235. The assumption that the mean cholesterol level would be 241 in persons not following the diet may or may not have been a valid assumption. We could have used different study designs that might have given a better assessment of the impact of the diet on cholesterol. For example, we could have used a concurrent comparison group (instead of a historical comparison). This type of study involves selecting a group of individuals appropriate for the study (e.g., males aged 50) and randomly assigning them to one of two groups. (Later
we
will describe in detail the procedures for assigning individ-
comparison groups.) One group of individuals follows the comparison group does not. At the end of 3 months, we compare the mean cholesterol levels between comparison groups. If the groups are similar (and this is related to random assignment), except that one group followed the diet and the other did not, then differences in cholesterol can be attributed to the diet. This is an example of what we call a two independent samples procedure, in which the two comparison groups are physically distinct (i.e., they are comprised of different individuals). Another study design for the assessment of the effect of the diet on cholesterol involves the 12 subjects, but before starting the diet, we measure their initial (sometimes called baseline) cholesterol levels. After each individual follows the diet for 3 months, we then measure a final cholesterol level. The focus in this design is how much each individual changes over time. If individuals' cholesterol levels drop from where uals at
random
to
diet while the
they started, is
we conclude
that the diet
an example of what we
call a
each individual serves as his or her In this chapter,
we
is
effective in reducing cholesterol. This
two dependent samples procedure,
own
in
which
control.
two independent and two dependent samThe most appropriate design for a specific applicavariety of factors, including the treatment and outcome and characteristics of the study subjects. These details will will describe
ples procedures in detail.
tion depends
on
a
under investigation be discussed in subsequent chapters.
The techniques described here are concerned with the difference between! two means. The techniques for estimating the difference between two means
Procedures for
Statistical Inference:
as well as the techniques tor testing if
one
two means
if
-//;i
233
are significantly different (or
larger than the other) are identical in principle to the techniques de-
is
scribed in Chapter 5, which were concerned with the lation
(fi\
fi.
The assumptions necessary
mean
of a single popu-
tor valid applications of the techniques
and formulas that follow are 1.
random samples from
2.
large samples
(«,
>
the populations under consideration
30, where
/
=
1,
2) or
normal populations
two independent samples procedures,
In
difference in population means:
(/ij
—
//;>).
the parameter of interest is the Confidence intervals in two inde-
pendent samples applications are concerned with estimating (n\ difference in means, as
opposed
mean
to the value of either
as
was
— n±), the the case in
The same is true in tests of hypotheses in and research or alternative hypothesis arc
the one-sample estimation problems. the two-sample case. Both the null
H
(no difconcerned with the difference in means, for example, fi\ — fa = (means are different). In two depenference in means) versus H,: \i — jUi 7^ dent samples procedures, the parameter of interest is the mean difference: /c
1
persons following
6.1 Procedures
a
a
comparison of the mean cholesterol
special diet as
compared
to persons taking
Concerning^ 2
Iwo independent populations and —
(2.67
2.67 2
r
H
r
i
M
+ 1)* 2
+ T
fl
0.62
l
9
Using the r distribution table (Table B.3), the two-sided 22 degrees of freedom is t = 2.074. The decision rule is Reject
Do 4.
()
if t
> 2.074 or
H
if
if t
value with
< -2.074
-2.074
2.074.
30 minutes of aerobic exercise ro females
We now followed by populations.
a
test
have significant evidence, a
mean
=
0.05,
heart rates following
aged 20-24 years as compared example, p = 0.02.
for females
aged 25-30 years. For illustrate
We
a difference in the
this
the preliminary test for
of hypothesis concerning
homogeneity of variances
means of two independent
25O
Chapter 6
Example 6.6
Statistical Inference:
Procedures for
(
\i\
Mean
Testing Difference in
—
M2)
Public Health Awareness Scores Between
Males and Females
Random
samples of 11 male high school students and 12 female high school
students are selected within a particular school district for an investigation. Students' scores
on
(PH) awareness
a public health
The test is scored on of more awareness.
descriptive statistics follow.
higher scores indicative
recorded; the
0-1000, with
Females
Males
Statistic
Sample
test are
a scale of
11
12
560.0
554.2
size
Mean PH
awareness score
Standard deviation
in
PH
awareness scores
129.4
133.1
Test if the male students score significantly higher than the female students on the public health awareness test within this school district using a 5% level
of significance. 1.
Set
up hypotheses.
H where 2.
ix\
= mean
Ml
=
M2
Hi: Hi
>
Hi,
:
a
=
score for males and hi
0.05
= mean
score for females
Select the appropriate test statistic.
In order to determine
if
this application falls into
Case 2 or Case 3, must be
a preliminary test of the equality of population variances
conducted. 1.
2.
Set
up hypotheses. tio:
u
CTj
=
er,
H:
a~
^
a^,
2
1
a
—
0.05
Select the appropriate test statistic. 2
S
F 3.
=
l
4
Decision rule. dfj
df2
F
= —1= =m—1= ti\
.975(10,ll)
=
Reject
Do
11
—
1
=
12
—
1
=
10
(numerator degrees of freedom)
1 1
(denominator degrees of freedom)
3.53 and F
H
if
.9- 5
(
F < 1/3.72
not reject
H
if
11,10)
=
=
3.72
0.269 or
if
0.269 < F < 3.53
F > 3.53
6.1 Statistical Inference
4.
Concerning (n j—
251
/ii|
Test statistic. s,
1133.11-
Conclusion. not reject H. since 0.269 < 1.06 < 3.53. We do not have show that a~ ^ er2 Therefore, for the purposes
Do
significant evidence to
of this test of means,
equal
=
o~
(i.e.,
.
we assume
a.) and apply the test
X,
df=n +n2 -2 = 1
3.
that the population variances are statistic
given under Case 2:
-X
2
ll
+ 12-2 = 21
Decision rule.
H,
Reject
Do
> 1.721
iff
Hu
not reject
if f
< 1.721
Test statistic.
We
first
compute S r
(«i
-
l)sj
~\
P
"1
+{n2 -
11( 129.4)-
11-12-2
V
Now
-
+
10(133. 1) 2
~
TK: The two sample robust
i.e..
of variances)
homogeneity ample,
it
t
tests
(
m
and Case 3) concerning [fi\ are generally assumptions such as normality and/or equality are equal (i.e., >u, - »2 ). However, the / test for
ase 2
\
insensitive to violations in
when t
the sample sizes
variances
is
sensitive to
the analytic variable
actual level of significance, u,
is
\
iolations in the normality assumption.
not normally distributed and the
may exceed
the specified level
e.g..
/'
test
is
1
or ex-
applied, the
Chapter 6
252
Statistical Inference:
Procedures for n\ (
- m)
SAS Example 6.6 Testing Difference in Mean Public Health Awareness Scores Between Males and Females Using SAS The following output was generated using SAS Proc Ttest, which conducts a two independent samples test of hypothesis. The same procedure automatically
produces a preliminary
test
of the homogeneity of variances.
A
brief in-
terpretation appears after the output.
SAS Output
Variable test test test
(
Statistics Upper CL Class Std Dev Std Err 219.77 female 37.365 male 233.61 40.136 187.5 54.767 Diff 1-2
Variable test test test
test test
Example 6.6
The TTEST Procedure Statistics Lower CL Upper CL N Mean Mean Mean 12 471.93 554.17 636.41 470.57 11 560 649.43 -119.7 -5.833 108.06
Class female male Diff 1-2)
Variable
for
Lower CL Std Dev Std Dev 91.692 129.44 93.011 133.12 100.94 131.2
Minimum
Maximum
260 370
750 770
(
Method Pooled Satterthwaite
Variable
Method Folded
test
T-Tests Variances Equal Unequal
21
Value -0.11
20.7
-0.11
DF
Equality of Variances Den DF Num DF 10
F
Interpretation of
for
Pr >
It
I
0.9162 0.9163
F Value
11
SAS Output
In the top section of the output,
t
1.06
Pr > F 0.9215
Example 6.6
SAS provides summary
statistics
on the an-
each comparison group (females and males) and then for the differences in means (females - males). The summary statistics include the sample sizes, the sample mean (Mean), and 95% confidence alytic variable (test score) for
means of each group and for the difference CI for the mean are labeled "lower CL mean" and "upper CL mean"), standard deviations, and 95% confidence intervals for the population standard deviations of each group and for the differences (the intervals (CI) for the population
in
means
(the limits of the
6.1 Statistical Inference
(fiy
"Upper CL maximums.
and
Dev"),
Std
standard
SAS
253
Std De\
"
minimums, and
(s/y/n),
SAS performs the test of hypothesis for two different tests, one in which
next section of the output,
equality of means.
errors
-/o)
CL
CI tor the standard deviation are labeled "lower
limits of the
In the
Concerning
actually carries out
assumed to be equal (we called this Case 2) and which the population variances are assumed to be unequal (we called this Case 3). SAS uses the formulas we summarized in Tables 6.3 and 6.5 for equal and unequal variances, respectively. The values of the test statistics appear under the column headed "t Value," and just before these SAS displays the degrees of freedom. Again, these are computed using the formulas from Tables 6.3 and 6.5. Finally, SAS provides two-sided p values (assuming that the population variances are
one
in
is Fi\\ /j.\ ^ Hi). user must decide which analysis (equal or unequal variances)
the alternative hypothesis
The
To
appropriate.
aid in this decision,
H
SAS provides
=
is
most
a preliminary test of the
^ a'). The results SAS Ttest output in the sec"Equality of Variances." SAS provides an F statistic (computed by
homogeneity of variances
(i.e.,
a~
:
()
a~ versus H\\ o~
of the preliminary test appear at the bottom of the tion titled
It computes F by dividing the larger sample variance by the smaller, regardless of the group (1 or 2) designation. Therefore, the F statistic produced by SAS is always greater than or equal to 1. In this case, since the sample variance among males is larger:
taking the ratio of the sample variances).
F
=
133.1
(
2 )
given by df,
129.4) :
/|
=
1.06.
The degrees of freedom associated with Fare
11 - 1 = 10 and df2 = n2 - 1 = 12 - 1 = 11. For SAS produces the probability of observing a value of F
= »i - 1 =
the preliminary
test,
more extreme than where p value
is
the
draw
p value, 0.9215,
is
test. In this
case,
ot r
is
if
);
the follow-
p value
F
the value of the test statistic (denoted Pr
ing rule should be applied to
and
this
example
is
considered an example of
ase 2.
Because the preliminary
Case 2
(i.e.,
test
equal variances),
the test statistic
is /
= — 0.
1
1
we
suggested that this example
is
an example of
look at the output for the equal variances case;
with 2
I
degrees of freedom (n\
+
ni
-
2).
For the
SAS produces a two-sided p value. In this example, the p value is 0.9162. Because we are interested in a one-sided test, the following rule should be applied: Rejeci Ho if f> value/2) < or. In this example, we do not reject //., since p value 2) = >()M\hl 2) = 0.4581 > 0.05. We do not have significant evidence to show that male students score higher than female students iin the main
test,
I
1
w ith 111 this school district. Note: The test statistic from that computed in xample 6.6 due to the fad that SAS orders the groups alphabetically Mt^\ calls females group I.) public health awareness
produced
In
SAS
test
differs
(
I
254
Chapter 6
Example 6.7
Procedures for (n\ -
Statistical Inference:
m)
Estimating Difference in Mean Number of Emergency Room Visits Between Children 5 and Under and 6-10 Years of Age Suppose we wish to estimate the difference in the mean numbers of emergency room (ER) visits in 12 months among children with asthma age 5 and under as
compared
to children aged 6-10. For the purposes of this investigation, our
analyses are restricted to children tions
(i.e.,
who
are free
from any other chronic condi-
The following data are collected on and under and 50 children aged 6-10:
they suffer from asthma alone).
random samples
of 65 children age 5
Mean Number
1.70.
We
have significant evidence to show
Therefore, for the purposes of this test of means,
we
b.
apply the
test statistic
Statistical Inference
I
257
2
5j
S,
H\
«2
Decision rule.
Using Table B.2B, the decision rule Reject Hoif
Do
Z>
is
1.960 or
not reject M)
if
if
- 1 .960
Z < -1.960 < Z < .960 1
Test statistic.
-
X,
Z=
s
i
x^ 5
3.4
-
4.5
j4~5 i '
5.
— m)
-X
X,
\
4.
{ti\
given under Case 3:
Z=
3.
Concerning
50
+
L~9
-1.10
=
-3.15
0.349
60
Conclusion. Reject Ho since —3.15
a
—
0.05, to
show
< —1.960. We have
that there
is
significant evidence,
a difference in the
mean number
the student health center between university freshmen
For this
test,
p
=
of visits to
and sophomores.
0.001 (see Table B.2B).
SAS Example 6.8 Testing Difference in Mean Number of Visits to Health Center Between Freshmen and Sophomores Using SAS
The following output was generated using SAS Proc
Ttest.
A
brief interpreta-
tion appears after the output.
SAS Output
;ble :s "S is
Class freshman sophomore Diff (1-2)
for
Example 6.8
The TTEST Procedure Statistics Lower CL Upper CL Lower CL Up] Mean Mean Mean Std Dev 2.8046 3.4071 1.7709 50 4.0096 4.1457 4.5059 60 4.8661 1.1819 .7 67 -1.099 -0.43 1.5543
Std Dev 2.12 1.3944 1.761
1
258
Chapter 6
Statistical Inference:
Variable visits visits visits
Variable visits visits
Variable visits
Procedures for (n\
Class freshman sophomore Diff (1-2)
Method Pooled Satterthwaite
Method Folded
- m)
Statistics Upper CL Std Dev Std Err 2.6418 0.2998 1.7007 0.18 2.0318 0.3372
T-Tests Variances Equal Unequal
DF 108 81.9
Minimum -1.548 1.4388
t
Equality of Variances Num DF Den DF 49
F
Interpretation of
Value
for
Example
8.0719
Pr >
1
1
-3.26 -3.14
0.0015 0.0023
F Value
Pr > F
2.31
0.0022
59
SAS Output
Maximum 7.7675
6.8
SAS provides summary
statistics on the anaeach comparison group (freshmen and sophomores) and then for the differences in means (freshmen — sophomores). The summary statistics include the sample sizes, the sample mean (Mean),
In the top section of the output,
lytic variable
and and
95%
(Number of
Visits) for
confidence intervals (CI) for the population means of each group
means (the limits of the CI for the mean are labeled "upper CL mean"), standard deviations, and 95% confidence intervals for the population standard deviations of each group and for the differences (the limits of the CI for the standard deviation are labeled "lower CL Std Dev" and "Upper CL Std Dev"), standard errors (s/y/n), minimums, and maximums. In the next section of the output, SAS performs the test of hypothesis for equality of means. SAS actually carries out two different tests, one in which the population variances are assumed to be equal (we called this Case 2) and one in which the population variances are assumed to be unequal (we called this Case 3). SAS uses the formulas we summarized in Tables 6.3 and 6.5 for equal and unequal variances, respectively. The values of the test statistics appear under the column headed "t Value," and just before these SAS displays the degrees of freedom. Again, these are computed using the formulas from Tables 6.3 and 6.5. Finally, SAS provides two-sided p values (assuming that for the difference in
"lower
CL mean" and
the alternative hypothesis
is
Hi:
ji\
^
1x2).
The user must decide which situation (equal or unequal variances) is most appropriate. To aid in this decision, SAS provides a preliminary test of
6.1 Statistical Inference
the homogeneity of variances
(i.e.,
Hr- o~
—o1
Concerning (pi
versus H\:
— m)
cr,
#o\
259
).
The
bottom of the SAS Ttest output in the section titled "Equality of Variances." SAS provides an F statistic (computed by taking the ratio of the sample variances). It computes F by dividing the larger sample variance by the smaller, regardless of the group (1 or 2) designation. Therefore, the F statistic produced by SAS is always greater than or equal to 1. In this case, since the sample variance among freshmen is larger: F = (2.12) 2 /(1.39) 2 = 2.31. The degrees of freedom associated with F are given by: df, = ti\ — 1 = 50 — 1 = 49 and df2 = m — = 60 — = 59. For the preliminary test, SAS produces the probability of observing a value of F more extreme than the value of the test statistic results of the preliminary test
appear
at the
1
1
(denoted Pr > F)\ the following rule should be applied to draw a conclu-
p value
the test statistic.
x^
=
sj
Z < -1.96
if
Z
h
=
62.72
6\5
Thus, n\
=
rii
=
63 subjects (126
SAS Example 6.14 Sample Size Determination Using SAS
total) are
needed.
Means
in Tests for Differences in
The following output was generated using SAS to determine the sample sizes (per group) required to ensure a specified power (we considered scenarios with 80% and 90% power) for differences in means of 0.4 and 0.3 units with a standard deviation of 0.3
units.
A brief interpretation appears after the
output.
SAS Output OBS
ALPHA BETA
1
05
2
05
1
3
05
2
4
05
2
Z
1
for
ALPHA2 95996 1 95996 1 95996 1 95996 1
Example 6.14 Z 1
1
Interpretation of
There
is
BETA 28155 28155 84162 84162
MU2
8
4
3
9
1
8
5
3
9
1
8
4
3
8
1
8
5
3
8
1
SAS Output
no SAS procedure
POWER
MU1
for
SIGMA
specifically designed to
we
N
SAS
to
22 9
16
determine the number of
means of two indepen-
did in Chapter 5 to determine the sample size re-
quired to detect a specific effect size in the one-sample test of hypothesis, use
2
12
Example 6.14
subjects required to detect a specific difference in the
dent populations. But as
ES 33333 00000 33333 00000
program appropriate formulas. Once
we
the formulas are imple-
mented, users can evaluate different scenarios easily. In the output shown, four scenarios are considered (denoted OBS 1—4). Scenario 1 corresponds to
Example
the situation presented in
6.14. Five variables are input into the
program, the level of significance (alpha), the power (power), the mean for group 1 (mul), the mean for group 2 (mu2), and the standard deviation (sigma). Several variables are created in the program and the values of all variables are printed in the output. A description of the variables and an interpretation of results follows. In scenario
1
(OBS
=
the probability of Type in
group
1
1 ),
II
the level of significance (alpha)
error,
ft, is
(mul) was specified as
computed
0.8; the
is
mean
in
set at
0.05 (5%);
(10%); the mean group 2 was specified as
to be 0.10
6.
0.4 (mu2); the standard deviation (sigma)
was
standard deviation. In scenario jects (n_2) are
means
5%.
of significance of
The
=
1 ),
and the power was computed by
(es)
means between groups by
the effect size
is
1.33.
the
Twelve sub-
required per group to ensure that the probability of detecting a
0.4 unit difference in
to 0.5.
(OBS
1
271
specified at 0.3;
(power) was specified at 0.90 (90%). The effect size dividing the absolute value of the difference in
Key Formulas
>
effect size
is
90%
In scenario 2 is
reduced to
(i.e., power = 0.90), with a two-sided level (OBS = 2), we change the mean in group 2 1.00, and a total of 22 subjects are required
per group to ensure that the probability of detecting a 0.3 unit difference in
means
is
90%
power = 0.90), with a two-sided level of significance of 5%. and 4 (OBS = 3 and 4), we consider the same scenarios and
(i.e.,
In scenarios 3
reduce the power to 0.80 (80%). The result is that fewer subjects are required. Nine and sixteen subjects are required per group, respectively, to ensure that the probability of detecting a 0.4 and 0.3 unit difference in means is 80%,
with a two-sided
level
of significance of
5%.
Key Formulas
6.3
Application
Notation/Formula
Confidence interval estimate for
(/
))
>
—
po(l
—
ri .
j
p(\
—
p)
\-
is
distri-
given here.
Estimating Proportion of Patients with Osteoarthritis Consider the data from Example 7.1 and compute a
95%
confidence interval
for the proportion of all patients in the physician's practice with diagnosed osteoarthritis.
The appropriate formula
P
±
is
given in Table 7.1:
;m-P) Zl-(a/2)
Substituting the sample data and the appropriate value from Table
95%
B.2A
for
confidence:
0.19±
,0.19(1 -0.19) 1.96.
200 0.19
±
0.19
±0.0549
1.96(0.028)
(0.135,0.245)
Thus,
we
are
95%
cian's practice
confident that the true proportion of patients in this physi-
with diagnosed osteoarthritis
is
between 13.5% and 24.5%.
7. 1
SAS Example
7.2
Statistical Inference
Concerning p m
297
Estimating Proportion of Patients with Osteoarthritis Using SAS
The following output was generated using SAS Proc
Freq, which generates a frequency distribution table for a categorical (or ordinal) variable. In this ex-
ample, tis
we
record whether each subject has been diagnosed with osteoarthri-
(or not).
The usual convention
to assign scores of
is
diagnosis of osteoarthritis) and scores of tis).
The input data
to failures
(i.e.,
1
to successes
(i.e.,
free of osteoarthri-
consists of designations (0 or 1) for each subject.
A
brief
interpretation appears after the output.
SAS Output
for
Example
7.2
The FREQ Procedure
x
Frequency
Cumulative Frequency
Percent 81.00 19.00
162 38
Proportion ASE 95% Lower Conf Limit 95% Upper Conf Limit
ASE under HO
One-sided Pr Two-sided Pr
0.0354
5 • given in Table 7.1:
P-Po
Z=
Po(l-Po)
3.
Decision rule (see Table B.2B
in the
Appendix
for the appropriate critical
value).
Reject
Do
H
if
Z < -1.960
not reject
H
if
or
if
-1.960
Z
—1.645.
show
We
do not have
significant
a reduction in the proportion of patients seen
in the clinic for flu after receiving the vaccine.
Example 7.4 brings up an important
issue of clinical versus statistical sig-
nificance. In the formal test of hypothesis,
we
failed to reach statistical signif-
However, we may have committed a Type II error (e.g., a larger sam pie size may be required to detect an effect). In any statistical application it is extremely important to look at the direction and magnitude of the observed icance.
effect. In
Example
7.4, there
is
a reduction in the proportion of flu cases seen
is 0.12, or 12%. Our test did not was statistically significantly lower than 15%; however, there is a reduction and it should be evaluated carefully. Is this reduction clinically important? On a different note, was the study design we used optimal
following the vaccines. The point estimate indicate that this
to address the question of effectiveness of the flu shots in pediatric patients?
A
concurrent comparison group might have provided a better comparison than
,
labia
7.2 Cross-Tabulation
7.2
historical data
(i.e.,
parison group
in the
p
(l
=
We
0.15).
will discuss tests
301
with a concurrent com-
following sections.
Cross-Tabulation Tables In
applications involving discrete variables, cross-tabulation tables are often
constructed to display the data. Cross-tabulation tables are also called
("R by C")
tables,
where R denotes the number of rows
denotes the number of columns.
A
2 x 2 table
is
illustrated in
Two
Cross-Tabulation to Summarize Proportions in
Example 7.5
A
in the table
RxC and
Example
C
7.5.
Populations
conducted to evaluate the long-term complications in diabetic patients treated under two competing treatment regimens. Complications are measured by incidence of foot disease, eye disease, or carlongitudinal study
is
diovascular disease within a 10-year observation period.
The following
2x2
cross-tabulation table summarizes the data:
Long-Term Complications Treatment Treatment
1
Treatment 2 Total
The estimate
Yes
No
12
88
Total
100
8
92
100
20
180
200
of the population proportion of
complications under treatment
1
(p,)ispi
=
patients
all
12/100
=
who
develop
0.12, by (7.2). This
is
equivalent to the estimate of the probability that a single patient develops
complications under treatment 1. The estimate of the probability that a single patient develops complications under treatment 2 is fc = 8/100 = 0.08.
The interest
probability oi success or is
the
outcome
Example
(in
development of complications!
is
7.5, the
outcome
of
often called the risk of out-
come. There are a number of statistics used to compare risks of outcomes between populations (or between treatments). These statistics .ire called effect measures and are described in detail 111 ( hapter 8.
SAS Example
7.5
Generating I
(
ross-Tabulations Using
SAS
he following output was generated using
SAS
Proc
I
req,
which generates
a
contingency table (or cross-tabulation) when two variables are specified. A brief interpretation appears after the output.
302
Chapter 7 Categorical Data
SAS Output
for
Example
7.5
The FREQ Procedure
Table of trt by compl Frequency Percent Row Pet compl Col Pet z_no lyes 1
1
1
trt_l
12
I
1
1
trt
1
1
8
1
1
1
Total
1
I
180 90.00
Sample Size
SAS generates
a contingency table
in top left corner of table).
and the Percent
is
the percent of
Row
Percent
is
50.00
1
1
1
200 100.00
7.5 cell
of the table displays the
and the Column Percent
all
reflect
is
the
number of
subjects in each
left cell (i.e.,
had complications). These patients
The
each
in
The Frequency
there are 12 patients in the top
0.06).
and
100
I
200
Example
Row Percent,
Frequency, the Percent, the
cell,
for
1
92
20
SAS Output
1
46.00 92.00 51.11
1
10.00
Interpretation of
50.00
1
+
1
4.00 8.00 40.00
1
100
1
44.00 88.00 48.89
1
+
+
trt_2
88
1
6.00 12.00 60.00
1
Total
1
1
subjects
6%
(see legend
subjects in each
cell.
For example,
on treatment
1
who
also
of the total sample (12/200
the percent of subjects in the particular
—
row
fall in that cell. For example, there are 100 patients on treatment 1, or 100 patients in the first row of the contingency table. The 12 patients who had complications reflect 12% of all patients on treatment 1 (12/100 = 0.12). The Column Percent is the percent of subjects in the particular column who fall in that cell. For example, there are 20 patients who report complications. These 20 patients appear in the first column of the contingency table. The 12 patients in treatment 1 who had complications reflect 60% of all patients who had complications (12/20 = 0.60). The row total and column total (called the marginal totals) are displayed to the right and at the bottom of the contingency table, respectively. Both row and column frequencies and
that
percents (of total) are displayed.
and
7.3 Diagnostic Tests: Sensitivity
7.3
303
Specificity
Diagnostic Tests: Sensitivity and Specificity A
diagnostic test
is
that
is
outcomes or events that are not
a tool used to detect
may have
For example, an individual
rectly observable.
A
not directly observable by a physician.
di-
a condition or disease
diagnostic test designed to
detect such a condition can be used as a tool to assist the physician in detection. Desirable properties in diagnostic tests include the following:
Example 7.6
an event when the event
The diagnostic
test will indicate
The diagnostic
test will indicate a
is
present,
nonevent when the event
and
absent.
is
Estimating Sensitivity and Specificity
A
clinical trial
detect
is
conducted to evaluate a diagnostic screening
chromosomal
fetal
abnormalities.
designed to
abnormalities
The diagnostic
sample of 200 pregnant women,
The following
test
fetal
test is performed on a ranundergo an amniocentesis. cross-tabulation table summarizes the data:
are confirmed using amniocentesis.
dom
Chromosomal
2x2
who
later
Diagnostic Test
Amniocentesis
Abnormal
14
(Disease)
Normal 1N0
Total
Negative
Posit ii e
20
6
64
116
180
78
122
200
Disease)
Total
Based on amniocentesis, the estimate of the population proportion of all carrving fetuses with
by
chromosomal abnormalities
(p) is
p
=
20/200
women
=
0.10,
(7.2).
The following of the
test,
used to describe diagnostic
statistics are
tests:
the sensitivity
the specificity of the test, the predictive value positive
(PV + and )
the predictive value negative (PV~). These statistics are defined as follows:
Sensitivity Specificity
= =
P(Positive test
I
P( Negative test
Disease) I
No
(7.5)
disease)
= P(Disease/Positive test) negative = P(No disease/Negative
Predictive value positive Predictive value
In
I
xample
7.6, the estimate
estimate of the specificity
is
t I
the sensitivity
In
180
=
0.64.
the
t I
In-
rest is
test)
14/20
=
0.70.
The
estimate ol the predictive
— 304
Chapter 7 Categorical Data
is PV* = 14/78 = PV" = 116/122 = 0.95.
0.18, and the estimate of the predictive nega-
value positive tive
is
In most cases, higher sensitivities and higher There are instances, however, where a better test
specificities are desirable.
determined by only one
is
criterion (e.g., higher sensitivity).
SAS Example 7.6 Estimating Sensitivity and Specificity Using SAS The following output was generated using SAS Proc Freq, which generates a contingency table (or cross-tabulation) when two variables are specified. SAS does not produce the estimates of table.
The
statistics
SAS Output
for
sensitivity, specificity, false positive rate,
and
but these can be extracted from the contingency
false negative rate directly,
of interest are described after the output.
Example 7.6
The FREQ Procedure Table of amnio by diagtest amnio diagtest Frequency Percent Row Pet Total Col Pet positive negative I
I
I
I
abnormal
14
1
1
1
1
I
I
7
6
1
.00
3.00 30.00 4.92
1
70 .00 17 95
1
1
20
1
1
10.00
!
1
H
normal
64
1
32 .00
1
1
1
1
1
35 56 82 05
116 58.00 64.44 95.08
1
1
1
I
18 C
90.00
1
1
H
Total
Interpretation of
The
sensitivity
is
test as positive,
of the table.
The
78
122
39
00
61.00
SAS Output
for
the proportion of
14/20
=
is
Example 7.6
abnormal cases correctly
0.70. This
specificity
200 100.
is
the
Row
classified
Percent in the top
by the
left cell
the proportion of normal cases that are correctly
by the test as negative, 116/180 = 0.644. This is the Row Percent of the bottom right cell of the table. The predictive positive value is the classified
- p:)
7.4 Statistical Inference Concerning (p,
proportion of normal cases classified as positive that are, This
the
is
Column
value Negative
in fact diseased,
of the table.
left cell
The
305
PV+
.
predictive
the proportion of cases classified as Negative that are, in fact,
is
=
normal, PV"
Percent of the top
m
=
116/122
0.95. This
is
the
Column
Percent of the bottom
right cell of the table.
7.4 Statistical We
Inference Concerning
often
—
compare two independent populations with
tion of successes in each. flu
(p,
A
p2 )
respect to the propor-
better study design to evaluate the effectiveness of
would involve two comparison and the other would receive a why is this important?). The analysis
shots in pediatric patients (Example 7.4)
groups.
One group would
placebo shot
(to
receive the flu shots
maintain blinding
—
would then compare the groups with respect to the proportions of children who developed flu. In the two independent samples situation, one parameter of interest is the difference in proportions, the risk difference: (p, — p : ), where p, = the proportion of successes in population 1 and p 2 = the proportion of successes in population 2.
The point estimate proportions,
is
-Pi
P.
where If
=
p,
independent
for the risk difference, or difference in
given by
=
the sample proportion in population i(i
1,2)
samples from both populations are sufficiently large (see criteria in in Table 7.2 can be
Table 7.2), then the confidence interval formula shown
used to estimate
(p,
—p
:
).
Table B.2A contains the values from the standard normal distribution for
commonly used confidence levels. When the sample sizes are adequate (i.e., if and only if mini;/) pi, >i]{ 1 - £1)) > 5 andram(nip\, nA — p\)) > 5), the con1
fidence interval formula given in Table 7.2
is
appropriate.
If either
(or both)
samon
ple size(s) are not adequate, alternative formulas are available that are based
the binomial distribution
Table 7.2
and not the normal approximation given
Confidence Interval for
- p:
(p,
)
Confidence Interval
Attributes
Simple random samples from binomial populations Independent populations I
here.
i
/>,
-
f> :
)
±
//fc(l-£i>\ ,,:./!
/,
I
"'
\
+
/fc(l-fc>\ I
I
"-
'
arge samples:
mini
'(|/'i-
mini':
'
5
*5
and
where
ft
-
Xj
«j
and
/> :
=
\
306
Chapter 7 Categorical Data
Example 7.7
Estimating Difference in Proportions of Children
Emergency
Room
Who
Use the
Between Treatments
We want to evaluate the effectiveness of a new treatment for asthma. The new treatment
is
administered
treatment administered
in
an inhaler and will be compared to a standard same way. Because asthma is a serious condi-
in the
would be unethical
to use a placebo comparator in this trial. Suppose emergency room (ER) use for complications of asthma during a 6-month follow-up period. A random sample of 375 asthmatic children are selected from a registry, of which 250 are randomized to the new treatment group and 125 are randomized to the comparison group (standard treatment). Both groups are provided instruction on the proper use of their inhalers. This allocation scheme is called 2-to-l, where twice as many participants are randomized to the investigational treatment as the control. Both groups of children are followed for 6 months and monitored for ER use. Of the children on the new treatment, 60 used the ER during the 6 months for complications of asthma, and 19 of the children on the standard treatment used the ER for complications of asthma during the same period. Construct a tion,
it
our outcome variable
95%
is
confidence interval for the difference in the proportions of asthmatic
new and standard treatments who used 6-month follow-up period. The data lavout is as follows:
children on the
n = 250
Asthma Registry
New
treatment
the
ER
during the
X=
60 use ER
X=
19 use
5 • and
min(« 2 p2,«2(l
-
Pi))
=min(125(0.15), 125(1 -0.15))
=
min(19, 106)
=
19
>
5
•
.4 Statistical
The formula from Table
7.2
Inference Concerning (p t -
307
/».'
appropriate:
is
Substituting the sample data and the appropriate value from Table B.2A for
95%
confidence:
0.09
±1.960
0.24(1-0.24)
250
+
0.15(1-0.15) 125
0.09 ±1.960(0.042) 0.09
±0.082
(0.008,0.172)
Thus, we are
95%
confident that the true difference in the population pro-
new treatment as compared to children on standard treatment who used the ER during a 6-month period is between 0.8% and 17.2%. Based on the confidence interval estimate, can we say that there is a significant difference in the proportions of asthmatic children on the new treatment as compared to the standard treatment who used the ER during a 6-month period? (Hint: Does the confidence interval estimate include 0?) portions of asthmatic children on the
Is
the
new treatment
NOTE: The
Notice the direction of the
effective?
effect.
two-sample confidence interval concerning (p, — p 2 opposed to the value of either proportion
difference in proportions, as in
In
estimates
)
(as
was
the
the case
the one-sample applications).
some
applications,
it is
of interest to
compare two populations on
the
basis of the proportions of successes in each using a formal test of hypothesis.
Table 7.3 contains the
Table
test statistic for tests
-j.yrest statistic
f>>r i/>,
concerning
Large samples* Simple random samples from binomial populations Independent populations
1,
Statistic
st
/;,).
P\-Pl
Z= !
where
\\
p\
P
r
-
-Pz)
Attributes
*min(n,p,.
,
=
"1
']
p
I
and p \
\
n
308
Chapter 7 Categorical Data
Example 7.8
Testing Difference in Proportions of Patients
Who
Experience
Pain Relief Between Treatments
A new drug is being compared to an existing drug for its effectiveness in relievOne hundred subjects who suffer from chronic headaches randomly assigned to either Group 1: Existing Drug, or Group 2: New Drug. Subjects do not know which drug they are taking in this experiment. Subjects are provided with a single dose of the assigned drug and instructed to take the full dose as soon as they experience headache pain and to record whether or not they experience relief from headache pain within 60 minutes. Among the 50 subjects assigned to Group 1: Existing Drug, 28 reported relief from headache pain within 60 minutes. Among the 50 subjects assigned to Group 2: New Drug, 34 reported relief from headache pain within 60 minutes. Based on the data, is the proportion of subjects reporting relief from headache pain within 60 minutes under the New Drug significantly different from the proportion of subjects reporting relief within 60 minutes under the Existing Drug? Use a 5% level of significance. The data layout is as follows: ing headache pain.
are
2:
n = 50 28
X=
1.
Set
n
New Drug
= 50
X=
(relief)
34
(relief)
up hypotheses.
W
:
Pi
= p2 0.05
where p
x
p2
2.
=
the proportion of patients
=
headache pain using the Existing Drug the proportion of patients who experience headache pain using the New Drug
who
experience relief from
Select the appropriate test statistic.
The sample proportions
are
28
P.
= j- = Q
0.56,
k=
| = 0.68
relief
from
7.4 Statistical Inference
First,
Concerning
ip,
check whether or not the sample sizes are sufficiently
min(«i£i,wi(l - pi))
= =
— pJ
309
large:
min(50(0.56),50(l -0.56)) min(28,22) = 22 > 5 •
and min(n : p:,n : (l - p ;
The appropriate
))
= =
min(50(0.68), 50(
min(34, 16)
1
>
-0.68)) 5
•
—+—
Decision rule. Reject
Do 4.
16
given in Table 7.3:
test statistic is
Ip(l-p)
3.
=
H
if
Z < -1.960
not reject
H
Z> -1.960 < Z
5.99
H
if
x
2
< 5.99
Test statistic.
To organize table
is
the computations of the test statistic (7.7), the following
used:
Mondavs Time
6:00-7:30 PM
Slot:
Thursdays 4:00-5:30'
PM
Saturdays 8:00-9:30 am
Total
O = observed frequency:
47
32
21
100
E=
33.3
33.3
333
100
13.7
-1.3
-12.3
(-1.3)-/33.3
(-12.3) 2 /33.3 = 4.54
expected frequency:
(O -
E):
(O - E) 2 /E:
2
(13.7) /33.3
=
NOTE: The sum frequencies (n
The
=
5.64
of the expected frequencies
100).
test statistic
is
x
2
=
10.23.
=
is
0.05
equal to the
sum of
10.23
the obseryed
7.5 Chi-Square Tests
5.
Conclusion.
H
Reject to
show
since 10.23
>
5.99.
We
have significant evidence, a
=
0.05,
that the three time slots are not equally popular or convenient for
the patients. In fact, almost half
6:00-7:30 pm
SAS Example 7.9
313
Goodness of
slot.
For
this
(47%) of the
patients selected the
example, p < 0.01
Fit Test for Patient Preferences
(see
Monday
Table B.5).
Using SAS
The following output was generated using SAS Proc Freq with an option run a goodness-of-fit
responses under the null hypothesis (see Section 7.9 for the
SAS Output
for
to
Specifically, the user specifies the distribution of
test.
SAS
code).
Example 7.9
The FREQ Procedure day
Frequency
Mon Sat
47 21
Thurs
32
Percent
Test Percent
47.00 21.00 32.00
33.00 33.00 33.00
Cumulative Frequency
Cumulative Percent
47 68
100
47.00 68.00 100.00
Chi-Square Test for Specified Proportions
Chi-Square
10.3333
DF
2
Pr > ChiSq
0.0057
Sample Size
Interpretation of
SAS
first
=
SAS Output
100
for
Example 7.9
generates a frequency distribution table and provides the
and percent of respondents
in
each response category. SAS then
lists
number the Test
Percent in each category (these are supplied by the user and reflect the ex-
pected proportions).
The
and cumulative percents square (df
reject
H
()
two columns contain the cumulative frequencies sample data. SAS then produces the chi-
for the
freedom : p value. Here / = 10.33 and p = 0.0()s~. We therefore because p = 0.0057 < 0.05 and conclude that the three time slots .ire
statistic
= k-
last
1)
and
for the goodness-of-fit test along with degrees of a
not equally popular or convenient for the patients.
^ 314
Chapter 7 Categorical Data
Example 7.10
Goodness of
Teen Issues
Fit Test for
Volunteers at a teen hotline have been assigned based on the assumption that
40%
of
all calls
are drug related,
are stress related,
each
and
10%
call is classified into
the caller.
To
test the
25%
are sex related (e.g., date rape),
concern educational
issues.
For
one category based on the primary issue raised by
hypothesis, the following data are collected from 120
randomly
selected calls placed to the teen hotline. Based on the data, assumption regarding the distribution of topic issues appropriate?
Topic Issue:
Set
is
the
Drugs
Sex
Stress
Education
52
38
21
9
Number of calls:
1.
25%
this investigation,
up hypotheses.
H
:
H,:
p
x
H
= is
=
0.40, p 2
0.25, p i
=
0.25,
p4
=
0.10
false
or
H
:
H,: 2.
Distribution across categories
H
is
a
false,
=
is
0.40, 0.25, 0.25, 0.10
0.05
Select the appropriate test statistic.
x
2
^ (Q-£) =E—
2
where J^ indicates summation over the k response categories
O = observed E = expected 3.
Decision
frequencies frequencies
H
(i.e., if
is
true, or
under
H
)
rule.
In order to select the appropriate critical value,
we
first
determine the
degrees of freedom.
df=fc-l=4-l=3 The appropriate decision rule
critical
value of x
1
X
is
2
=
7.815 from Table B.5. The
is
Reject
Do
H
if
x
not reject
2
>
H
7.8 15 if
x
2
< 7.815
7.5 Chi-Square h:,
i
the following
used:
is
Drugs
Topic Issue:
Sex
Stress
Education
TOTAL
O = observed frequency:
52
38
21
9
120
£=
48
30
30
12
12(1
4
8
-9
-3
expected frequency:
(O -
E):
(0-E) l /E:
48
A
=
2
(8)
0.33
NOTE: The
/30
=
2.13
2
(-9) /30 2.70
=
2
=p
p.
Confidence interval estimate for I
P\
-
Test //
Pi
:
p.
(pi
-
/>:>
-
-
1
Pi'
.
P:
':
»i
'
Z
po)
-Pi'
1
Z,.
for
1
.
See Table 7.1 for necessary conditions
P-Po
Z=
/
necessary conditions
n
Pi
Z
in
Table B.2A
See Table
7.
3
(find
tor necessar)
conditions and definitions of components of Z Test //
:
distribution
=
X
of responses follows
?
2-
'
df
= *~
C
In-square
•
goodness-of-fit test
specified pattern
Test //
:
two variables
are independent
Find n to estimate
,
£