Introductory applied biostatistics [1 ed.] 9780495014713, 0495014710, 9780534423995, 053442399X

INTRODUCTORY APPLIED BIOSTATISTICS (WITH CD-ROM) explores statistical applications in the medical and public health fiel

792 88 59MB

English Pages 652 [678] Year 2006

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Introductory applied biostatistics [1 ed.]
 9780495014713, 0495014710, 9780534423995, 053442399X

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Vfi

"( Ralph B. D'Agostino.

Sr.

Lisa M. Sullivan

Alexa

S.

Beiser

Introductory

Applied Biostatistics

A *

Digitized by the Internet Archive in

2012

http://www.archive.org/details/introductoryapplOOdago

Ralph D'Agostino

Sr. is Professor of Mathematics, Statistics, and Public Boston University. A respected and widely published statistician, he has more than 30 years of experience in running clinical trials and epidemiological research. He is a Senior Editor of STATISTICS IN MEDICINE, Fellow of the American Statistical Association and Epidemiologic Section of the American Heart Association. He was Chair (2003) of the American Statistical Association Section of Statistics in Epidemiology. He serves as Executive Director of Biometrics and Data Management for the Harvard Clinical Research Institute. His interests are in biostatistics and robust procedures, longitudinal data analysis, and

Health

at

numerous awards include the Food & Drug Administration Commissioner's Special Citation in 1981 and 1995. He is Director of Data Management and Statistical Analysis for the Framingham Heart Study that for more than 50 years has searched for

multivariate data. Dr. D'Agostinois

common

factors that contribute to cardiovascular disease.

author of five books

Lisa Sullivan

is

in

He

is

co-

various fields of statistical methodology.

an Associate Professor of

Biostatistics at the

Public Health, Associate Professor of Mathematics and

School of

Statistics

at

Boston University, and Assistant Dean for Undergraduate Education in Public Health at Boston University, where she received her MA and her PhD. She has won numerous awards for excellence in teaching. Her research interests include applied biostatistics, longitudinal data analysis, design and analysis of clinical trials, and hierarchical modeling. Dr. Sullivan spends most of her time in the Boston University Statistics and Consulting Unit working on the Framingham Heart Study. Her recent research focuses on developing health risk appraisal functions to quantify individuals' risks of developing cardiovascular disease. Her dozens of articles are published in prestigious periodicals such as the

JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, NEW ENGLAND JOURNAL OF MEDICINE, and STATISTICS IN MEDICINE. Away

from work, she enjoys running and cooking.

Alexa Beiser is Professor of Biostatistics in the School of Public Health at Boston University. She received her MA from University of California at San Diego and her PhD from Boston University. Her research interests include clinical trials methodology, statistical computing, and survival analysis. Dr. Beiser joined the Framingham Study in 1994 after spending many years collaborating on pediatric research projects. She primarily investigates risk factors for stroke, dementia, and Alzheimer's disease using Framingham Study data. Her foremost methodological interest is in estimation of lifetime risk of disease. Dr. Beiser has published articles in the NEW ENGLAND JOURNAL OF MEDICINE, the JOURNAL OF THE AMERICAN MEDICAL ASSOCIATION, STATISTICS IN MEDICINE, STROKE, and NEUROLOGY. She enjoys reading, traveling,

and spending time with her four

children.

Introductory Applied Biostatistics

Ralph

B.

D'Agostino,

Sr.

Boston University

Lisa M. Sullivan Boston University

S. Beiser Boston University

Alexa

THOMSON *

BROOKS/COLE Australia • I

(

anada

Kingdom





Mexico



Singapore

United States



Spain

THOIVISON *-

BROOKS/COLE Introductory Applied Biostatistics

Ralph

B.

D'Agostino,

Sr.,

Editor: Carolyn Crockett

Assistant Editor:

Lisa

M.

Sullivan, Alexa

S.

Beiser

Permissions Editor: Sarah Harkrader

Ann Day

Production Service: Matrix Productions/Merrill Peterson

Editorial Assistant: Daniel Geller

Text Designer: Carolyn Deacy

Technology Project Manager: Burke Taft

Copy

Editor: Pamela Rockwell Cover Designer: Simple Design/Denise Davidson Cover Image: PhotoDisc® Getty Images™

Marketing Manager: Stacy Best Marketing Assistant: Jessica Bothvvell Executive Marketing Communications Manager: Nathaniel

Bergson-Michelson Project Manager, Editorial Production: Kelsey

McGee

Printing,

Art Director: Lee Friedman Print Buver: Lisa

© 2006 Duxbury, part of

Compositor: Interactive Composition Corporation Cover Printer: Phoenix Color Corp

Cover

Printing,

and Binding: R.R. Donnelley/

Crawfordsville

Claudeanos

an imprint of Thomson Brooks/Cole, a

The Thomson Corporation. Thomson,

the Star logo,

and Brooks/Cole are trademarks used herein under

license.

Thomson Higher Education 10 Davis Drive CA 94002-3098

Belmont,

USA ALL RIGHTS RESERVED. No by the copyright hereon

form or by any means

may

part of this

be reproduced or used in any

—graphic,

electronic, or mechanical,

including photocopying, recording, taping, distribution, information storage in

any other manner

work covered

and

Web

retrieval systems, or

— without the written permission of

the publisher.

5 Shenton Way #01-01 UIC Building

Singapore 068808

Australia/New Zealand

Thomson Learning 102 Dodds Street

Printed in the United States of America 3 4 5 6 7

Asia (including India)

Thomson Learning

09 08 07 06

Australia

Southbank, Victoria 3006 Australia

For more information about our products, contact us

at:

Thomson Learning Academic Resource Center 1-800-423-0563 For permission to use material from

Canada

Thomson Nelson 1120 Birchmount Road

this text or

product,

submit a request online at http://www.thomsonrights.com,

Any additional questions about permissions can be submitted by email to [email protected].

Toronto, Ontario

M1K 5G4

Canada UK/Europe/Middle East/ Africa

Thomson Learning High Holborn House Library of Congress Control

Number: 20041 14720

50-51 Bedford

Row

London WC1R4LR United Kingdom ISBN 0-534-42399-X Latin America

Thomson Learning Seneca, 53

Colonia Polanco

11560 Mexico Mexico

D.F.

Spain (including Portugal)

Thomson

Paraninfo

Calle Magallanes, 25

28015 Madrid, Spain

Brief Contents

CHAPTER

Motivation

i

i

CHAPTER

2

Summarizing Data

CHAPTER

i

Probability 87

CHAPTER

4

Sampling Distributions 149

CHAPTER

5

Statistical Inference:

Procedures for

CHAPTER

6

Statistical Inference:

Procedures

for

(fi

t

-fi 2 )

15

fi

231

CHAPTER

7

Categorical Data 293

CHAPTER

8

Comparing Risks

CHAPTER

9

Analysis of Variance 407

CHAPTER

io

Correlation and Regression 465

CHAPTER

ii

Logistic Regression Analysis 507

CHAPTER

12

Nonparametric Tests 545

l

Introduction to Survival Analysis 585

CHAPTER

j

173

in

Two

Populations 359

in

Contents

Preface xi CHAPTER

i

Motivation

i

1.1

Vocabulary 2

1.2

Population Parameters 4

1.3

Sampling and Sample

1.4

Statistical Inference

CHAPTER

Statistics

2

Summarizing Data 2.1

2.4

iv

Vocabulary 17

2.1.2

Classification of Variables

2.1.3

Notation 18

17

Descriptive Statistics and Graphical 2.2.1

2.3

15

Background 17 2.1.1

2.2

7

10

Methods 19

Numerical Summaries for Continuous Variables 19

2.2.2

Graphical Summaries for Continuous Variables 39

2.2.3

Numerical Summaries for Discrete Variables 44

2.2.4

Graphical Summaries for Discrete Variables 51

Key Formulas 57 2.3.1.

Graphical Methods: Continuous Variables 58

2.3.2

Graphical Methods: Discrete Variables 59

Statistical

Computing 59

2.4.1

Continuous Variables 60

2.4.2

Discrete Variables 62

2.4.3

Summary

of

SAS Procedures 66

Contents

Framingham Heart Study Data 67

2.5

Analysis of

2.6

Problems 72

CHAPTER

3

Probability 87 3.1

Background 88

3.2

First Principles

3.3

Combinations and Permutations 98 The Binomial Distribution 103 The Normal Distribution 108 Percentiles of the Normal Distribution 118 3.5.1

Vocabulary 88

3.1.1

3.4 3.5

90

Normal Approximation

3.5.2 3.6

Key Formulas 124

3.7

Applications Using

Summary

3.7.1

120

SAS 125

SAS Functions 133 Framingham Heart Study Data 133

3.8

Analysis of

3.9

Problems 136

CHAPTER

to the Binomial

of

4

Sampling Distributions 149 4.1

Background 150

4.2

The Central Limit Theorem 152 Key Formulas 162 Applications Using SAS 162

4.3

4.4

Summary

4.4.1

4.5

of

SAS Functions 167

Problems 167

CHAPTER

5

Statistical Inference:

for 5.1

//,

Procedures

173

Estimating

/j

174

5.1.1

Vocabulary and Notation 174

5.1.2

Confidence Intervals for

5.1.3

Precision and

Sample

//

Size

177 Determination

1() years of age with diagnosed coronary .irtery disease. Had the patients been sampled from de-

population of patients JO \e.irs i age free of cardiovascular disease, these observed systolic blood pressures might be slightly higher than expected.

20

Chapter 2 Summarizing Data

700 instead, it would be impossisummarize the sample with respect to systolic blood pressure by inspecting the values. In fact, the same would probably be true if the sample included 20 subjects. In most applications, it is necessary to use statistical techniques to summarize a sample. Several of these are described next. To simplify[f

the sample included not 7 patients, but

ble to

the computations, each data element sure)

is

and the subscripts X,

=

It is

each observed systolic blood pres-

(i.e.,

represented by the variable X. Here

121

(i



X denotes systolic

1,2, ... ,7) denote the subject

X2 = 110 X3 = 114 X, =

100

X5 =

number

=

160 Xe

blood pressure,

in the sample:

130 X-

=

130

generally of interest to summarize a continuous variable with respect

to location. Location refers to the "center" of the data set

and addresses the

What is a typical systolic blood pressure? In the computations that follow, we can drop the subscripts, because the subject number (i.e., subscript) has no impact on them. To organize our calculations, we arrange the question,

data elements in a column, as shown. Notice that the data elements are or-

dered from smallest to largest nient)

and

(this

that the data element

is

130

not necessary but is

is

sometimes conve6 and 7 each

listed twice, as subjects

have systolic blood pressures of 130. X,

100 110 114 121

130 130 160

The first descriptive statistic we consider is the sample mean, denoted X ("X bar"). The sample mean is one statistic that summarizes the average value of a sample; it gives a sense of what a typical value looks like. To compute the sample mean, we sum all of the observations and divide by the sample size. The sample mean of the systolic blood pressures is

X=

(

100

+

110

In mathematics, the

symbol

sample mean of the

systolic

+114+

121

+

130

+

130

+

160)/7

^

(uppercase "sigma") denotes summation. The blood pressures can be represented as follows:

X=EX,/7 where

£X

=

100

+ 110+114+121 +

130 4-130

+

160

2.2 Descriptive Statistics

sample mean

In general, the

is

The mean elements, the

mean

>/

In the final expression of the formula, the subscript

summation

is

understood to be over

systolic

we

blood pressure

some of

see that

21

denoted:

;;

NOTE:

and Graphical Methods

all

is

/

is

suppressed and the

subjects in the sample.

X=

865/7

=

123.6. Reviewing the data

the observed systolic blood pressures are above

of 123.6, and others are below the mean. The

mean

of 123.6

is

in-

terpreted as the average, or typical, systolic blood pressure in the sample. In

and research reports, readers are generally not shown the summary statistics such as the sample size and

journal articles

actual data elements. Instead,

sample mean are provided. The sample mean is referred to as the balancing point, or pivot point, of the sample since the sum of the distances between observations below the mean and the sample mean is equal to the sum of the distances between observations above the mean and the sample mean (see dotplot in Figure 2.2). These distances or "deviations from the mean" are denoted (X— X). The following table displays the data elements along with their respective devia-

from the mean

tions

(i.e.,

distance from

X

(X-X)

100

-23.6

110

-13.6

114

-9.6

121

-2.6

1

JO

6.4

1

JO

6.4

160

J6.4

123.6):

-0.2*

865 * This

here

X=

is

sum

is

theoretically zero; the difference

due to rounding.

Figure 2.1 Sample Wean •



100

110



as Balancing Point

• 120

X

t

123.6

130

140

150

160

22

Chapter 2 Summarizing Data

The sample mean measures ple.

Location

the location, or central tendency of the samvery important in interpreting sample data. However, two

is

very different samples might produce the same sample mean. Consider a sec-

ond sample of 7

subjects from the population of patients 50 years of age with diagnosed coronary artery disease. Again we measure systolic blood pressures, in millimeters of mercury (mraHg), on each subject. The sample data are

120

12:

122

124

125

126

127

The sample mean for this sample is X=(l 20 +121 + 1 22 +124 +125 + 126 + 127)/7 = 865/7 = 123.6. The sizes and the means are the same in the two samples, yet the samples are quite different. For a more complete understanding of the data, we also need a measure of Measures of dispersion address

the dispersion, or variability, in the sample.

whether the data elements are Specifically,

we

tightly clustered together or widely spread.

are interested in whether the data elements are tightly clus-

mean or whether the data elements are widely spread above and below the mean. The goal is to generate an estimate of the dispersion in the sample, in particular the dispersion of the data elements about the sample mean. The deviations from the mean sum to zero, since the negative deviatered about the

tions "cancel out" the positive deviations (Figure 2.2).

Of

real interest

is

the

magnitude of these deviations. Several techniques can be employed to summarize the magnitude of the deviations from the mean. One method is the mean absolute deviation (MAD), which is simply the mean of the absolute values of the deviations from the mean:

MAD = Eix-xi This technique

is

not generally used for mathematical reasons that are beyond

The more popular statistic, which proves to be the most straightforward mathematically, is based on squared deviations from the mean and is called the sample variance, defined as:

the scope of this book.

$2=

n^

X)

;

(2.2) 1

NOTE: The denominator the sample

The following ance.

It

in the

sample variance

is

(n



1),

not n as was the case with

mean. table organizes data for the

computation of the sample varifrom the mean, and squared

displays the data elements, deviations

deviations from the mean, respectively.

2.2 Descriptive Statistics

and Graphical Methods

X

(X-X)

100

-23.6

556.96

110

-13.6

184.96

114

-9.6

92.16

121

-2.6

6.76

130

6.4

40.96

(X-X) 2

130

6.4

40.96

160

36.4

1,324.96

865

-0.2*

2,247.72

*

The sum

is

23

not exactly zero due to rounding.

For Example 2.1, the sample variance

=

5"2

is

2,247.72

=

374.6

6

The sample variance

is interpreted as the average squared deviation from on average, systolic blood pressures in our sample are 374.6 units, squared from the sample mean of 123.6. This information is important; however, in its present form it does not exactly achieve our original goal, which was to compute a measure of the typical deviation from the mean in the sample. Recall that we summed the square of each deviation from the mean since their sum was zero. Because of this step, the sample variance does not address our original objective directly. To return to our original units, we compute what is called the sample standard deviation, denoted s, defined as the square root of the sample variance.

the mean. Therefore,

5

The sample standard deviation of 5

=

= Vs 2

(2.3)

the systolic blood pressures

y/374.6

=

is

19.4

we have a statistic which can be interpreted from the mean. In this sample, systolic blood pressures are about 19.4 units from the sample mean. It is often difficult to interpret the After taking the square root,

as the typical deviation

value of a standard deviation

Standard deviation, however, the second sample of n I



(e.g., is is

19.4 large, small, or appropriate?).

~ subjects

we

from the same population. and the same mean (X = 123.6); the second sample is s = 2.6. The stan-

be second sample had the same size (n

however, the standard deviation for dard deviation in the second sample vations

.ire

tightly clustered

The

very useful for comparing samples. Recall

is

=

selected ~l

much

smaller because

around the sample mean

ol 12

all

v(->.

ol the obser-

There

is

much

24

Chapter 2 Summarizing Data

more

variability in the systolic blood pressures measured among patients with diagnosed coronary artery disease in the first sample (sample 1: s = 19.4) as

compared

An

to the second sample

(5



2.6).

computing the sample variance that mathematically equivalent to the formulation provided in (2.2). This alternative formula is called the computational formula for the sample variance alternative formula

available for

is

is

(Eq. 2.4).

The formula provided

in (2.2) is called the definitional

formula for

the sample variance.

£X

2

=

s~

where

-(£X) 2 /k (2.4)

^X

= the sum of the squared observations, and CY^X) = the square of the sum of the observations 2

1

The computational formula

(2.4)

can be easier to work with than the

defini-

components in the computational formula are in most cases easier to compute (i.e., Yl^2 and (J^X) 2 ). We will now illustrate the use of the computational formula for the sample variance using data from Example 2.1. The following table displays each tional formula given in (2.2), because the

data element, along with each data element squared.

X

X2

100

10,000

no

12,100

114

12,996

121

14,641

130

16,900

130

16,900

160

25,600

865

109,137

Using the computational formula

s~



(2.4):

109 ,137 -(865) 2 /7

7-

1

109,137- 106,889.3

2247.7

=

374.6

As noted, the computations can be somewhat easier with the computational formula as compared to the definitional formula. To implement the computational formula, we need only to compute the sum of the data elements and the

2.2 Descriptive Statistics

sum

and Graphical Methods

25

of rhe squared data elements, as opposed to deviations and deviations

squared for each data element. The reduced number of calculations with the computational formula reduces the chance of error.

A standard data summary

for a continuous variable in a

sample

consists of three statistics:

sample

size (n)

sample mean (X)

sample standard deviation

These three

statistics

provide information on the number of subjects

the sample, the location,

We

(5)

and the dispersion of the sample,

purposely chose a small data

much

in

respectively.

set to illustrate these statistics.

It is

easy

which it would be impossible to view the entire sample. In such cases, the sample size, mean, and standard deviation provide a very informative and useful summary. Publications and reports almost always include these statistics. As a general guideline, descriptive statistics should include no more than one decimal place beyond that observed in the original data elements. For example, the systolic blood pressures are recorded as whole numbers. Therefore, descriptive statistics are presented to the nearest tenths place (i.e., one decto imagine applications with

sample

larger

sizes in

imal place).

The standard summary

A number

for

Example



2.1 \sn

of other descriptive statistics beyond

dard summary

statistics (i.e., n,

variables, including the

X, and

median and

s)

7,

X=

123.6, and 5

what we have

=

19.4.

called the stan-

are also widely used for continuous

quartiles.

The sample median

many

is

defined as

below it. computed by arranging the data elements from smallest to largest and successively counting from the right and left to arrive at the median, or middle, value. For example, if we arrange our 7 data elements from smallest to largest and count in from the right and left simultaneously, we arrive at the median or middle value after three steps: the middle value.

The median

Step

1:

It is

the value that has as

values above

as

is

Inn

TttvX

110

1

14

121

130

130

1

60

Step 2:

we- 4+0

114

121

130

Step

—ne-

He- 4^e

foe

4+4

121

F30-

130

J:

it

160

ft

Median Because the Dumber of observations in this sample is odd [n = 7), this procedure produces a single number, the median value, upon successively counting

26

Chapter 2 Summarizing Data

in

from the

right

and

In

left.

Example

2.2,

we

will illustrate the

same proce-

dure with an even number of data elements. The interpretation of the median in Example 2.1 is as follows: Half (50%) of the systolic blood pressures are greater than 121

and

half

(50%)

are less than 121.

Both the mean and the median are

statistics that

or typical value of a particular characteristic. useful

when

compared

measure the average

The median

is

particularly

there are extreme values (either very small or very large as

Suppose in Example 2.1 that the was not 160 but 260 instead. The sample mean would be X = 100 + 110 + 114 + 121 + 130 + 130 + 260)/7 = 137.9, which does not look like a typical value (since 6 of 7 observations are below it). The sample mean is affected by extreme values. In this case, the value 260 inflates the mean value, and it is therefore no longer representative of a typical value. A better measure of location in this situation is the median, which is still 121 and is more representative of a typical systolic blood pressure in this sample. In the absence of extreme values, the sample mean and the sample median will be close in value and the sample mean is considered a better measure of locato other values) in the sample.

maximum

value

(

observations contribute to the sample mean.

tion since

all

mean and

the sample

When

the sample

median are very different, it suggests that extreme values are affecting the mean and that the median might be a more appropriate measure of location. As we work through more examples, it will become clear which measure of location (the mean or the median) is more appropriate in specific applications.

Example

=

method of sucmedian is easy to implement. When the sample size is larger, a more efficient method for computing the median involves two steps. In the first step, we compute the position of the median in the ordered data set, and in the second step, we locate the median value. When the number of observations is odd, the position of the median is computed as follows: Because the sample

size in

2.1

is

small (n

7), the

cessively counting into the middle of the ordered data set to locate the

«

+

1

2.5

For Example 2.1, the median is in the fourth position ((7+l)/2 = 4) in and is equal to 121. The median represents the middle value; to further describe the sample, we now analyze the top and bottom the ordered data set

halves.

The first and third quartiles are the values that separate, respectively, the bottom and top 25% of the data elements. The first quartile of the sample, denoted Q,, is the sample value that holds approximately 25% of the data elements at or below it and approximately 75% above or equal to it. The third quartile, denoted

or above

it

Q„

holds approximately

and approximately

75%

25%

of the data elements at

below or equal to

it.

The median

is

also

2.2 Descriptive Statistics

referred to as the second quartile, is

Q

:

The

.

best

and Graphical Methods

way

to determine the quartiles

median and

to follow the two-step procedure outlined for determining the

compute

(i.e., first

27

the positions of the quartiles in the ordered data set,

then locate the values).

When

the

number of observations

in a

sample

is

odd,

the positions of the quartiles are determined by the following formula:

+

n

3 (2.6)

where

[k]

is

[2.9]

the greatest integer less than k; for example, [2.1]

=

2, [5.0]

=

=

5, [10.8]

10,

=

2,

and so on

For Example 2.1, "« „431 + 3"

i"7

[7

4

+ 3nn -i-

4

Example 2.1, the quartiles are tom of the ordered data set. The In

"10" rini

[2.5|=2 _

in the

4

_

second positions from the top and bot-

is Q, = 10 and the third quar25% of the systolic blood pressures are 10 or lower and approximately 25% of the systolic blood pressures are 130 or higher. Again, because the sample in Example 2.1 is so small (n — 7), we do

tile is

O; =

first

quartile

1

130. Approximately

I

of the statistics described to summarize and interpret these data.

not need

all

In larger

samples, the quartiles are very informative statistics for understand-

ing the distribution of a particular characteristic.

The mode of the data set is defined as the most frequent value. In Example 2.1, the mode is 130, since it appears twice and the remaining values appear only once. A sample can have one mode or several modes. A sample with no repeated values has no mode. Other very informative descriptive statistics include the minimum and maximum values. In Example 2.1, the minimum is 100 and the maximum is 160. These values can be very useful, especially with regard to identifying outliers. Outliers are values that exceed the "normal" or expected range of values. or example, suppose ages are recorded on each of 20 individuals participating in an experimental study. Suppose the mean age for the sample is 83.5, with a standard deviation of 5.6. Suppose the minimum age is 70 and the highest five ages, in descending order, are 10, 90, 89, 89, and 87. Assuming that each age was I

1

recorded accurately, an age of

1

10 might be considered an outlier.

incorrect value, just a value outside



in this

ease,

Outlying values can be determined b\ an expert area or by using one of several

Assuming there

statistical

in

It

is

not

above—the normal

.\n

range.

the particular substantive

definitions

(see

Example

2.3).

no errors in the data, the statistical analyst need not do anything in particular w ith respect to outliers, onl\ be aware of their existence and their impacl on certain descriptive Statistics (e.g., the sample mean!. are

28

Chapter 2 Summarizing Data

Another descriptive

statistic that

addresses dispersion in a data set

is

the

The range is defined as the maximum value minus the minimum value. In Example 2.1 the range = 160 — 100, or 60. Some investigators report the range as "100 to 160"; others report the range as 60. Both reports are appropriate. As noted, the range addresses dispersion in the sample. In Example 2.1, the observed systolic blood pressures cover 60 units. The range is based on only two values in the sample, the maximum and minimum, and although it is a very useful statistic, it can be somewhat misleading, especially in the presence of outliers. For example, if the maximum value was 260 instead of 160 and all other data elements were unchanged, the range would be 260 — 100 = 160. This would suggest much more dispersion in the sample than the range of 60 (based on the data presented in Example 2.1 when only a range.

)

single observation changed.

We suggest that the range

be interpreted with cau-

and that the standard deviation be used to address dispersion in a sample. Consider the following samples, call them samples A, B, C, and D. The samples are all of the same size (n = 11), have the same means (50) and ranges (100), yet

tion

the standard deviations are different.

How are the samples different? Sample

Raw Data

Summary n

X Range

50

10

20

50

20

20

50

30

20

50

40

20

50

50

50

50

50

60

80

100

50

70

80

100

50

80

80

100

50

90

80

100

100

100

100

100

11

11

11

11

Statistics

50

50

50

50

100

100

100

100

11

33

35

50

2.2 Descriptive Statistics

Based on the standard deviations, the

among

first

and Graphical Methods

sample has the

observations, and the last sample has the most.

least

29

variation

The range,

in this

example, does not differentiate the samples.

SAS Example

2.1

Summary

Statistics

The following

on

Systolic

Blood Pressures Using SAS

were generated using SAS Proc Univariate and Section 2.4 for more details) and the An interpretation of the relevant components appears

descriptive statistics

(see the following interpretation

data

in

Example

2.1.

after the output.

SAS Output

for

Example

2.1

Summary Statistics Summary Statistics The UNIVARIATE Procedure Variable: sbp (systolic blood pressure)

Moments Sum Weights 123.571429 Sum Observations 19.3550781 Variance 1.04211435 Kurtosis 109137 Corrected SS 15.663069 Std Error Mean 7

Mean Std Deviation Skewness Uncorrected SS Coeff Variation

7

865

374.619048 1.63467176 2247.71429 7.31553189

Basic Statistical Measures Location Variability Mean 123.5714 Std Deviation 19.35508 Median 121.0000 Variance 374.61905 Mode 130.0000 Range 60.00000 Interquartile Range 20.00000

Test Student's

t

Signed Rank

Tests for Location: Mu0=0 -Statisticp Value t 16.89165 Pr > Itl = IMI 0.0156 S 14 Pr >= 0.0156 SI I

30

Chapter 2 Summarizing Data

Quantiles (Definition 5) Quantile Estimate 160 100% Max 99% 95% 90% 75% Q3 50% Median 25% Ql 10%

160 160 1 6

130 121 110 100 100 100 100

5% 1% 0% Min

Extreme Observations HighestLowest Value Obs Value Obs 100 110 114 121 130

Interpretation of

114 121 130 130 160

4 2 3

1

7

SAS Output

for

SAS Example

3

1 6

7 5

2.1

The SAS Univariate Procedure is used to generate descriptive statistics on a continuous variable. SAS generates a number of descriptive statistics in the section labeled "Moments"; we will highlight only a few. The sample size is 7 (notice that SAS uses uppercase N as opposed to lowercase n to denote sample size), the sample mean is 123.6, and the sample standard deviation is 19.4. The sum of the observations (i.e., X!^C = J2X) is 2 865, and the sample variance, s is 374.6. The skewness of a sample indicates the degree of asymmetry in the sample distribution. Values close to are indicative of symmetry. The kurtosis of a sample indicates the thickness in the ,

tails

of the distribution

(i.e.,

the degree of clustering of observations at the

extremes). Again, values close to

indicate lack of clustering in the

tails.

somewhat unreliable in small samples and should be interpreted with caution. The normal distribution (discussed extensively in Chapter 3) has skewness = and kurtosis = 0. The uncorrected sum of squares, "Uncorrected SS," is the sum of the observations squared (i.e., 2~Z^2 = 109,137, which we used in the computational formula for the sample variance). The corrected sum of squares, "Corrected 2 SS," is the numerator of the sample variance, ]T(X — X) = 2247.7. The Estimates of skewness and kurtosis are

2.2 Descriptive Statistics

and Graphical Methods

31

"Coeff Variation," is defined as the ratio of the samsample mean, expressed as a percentage (i.e., (s/X) * 100). The standard error of the mean, "Std Error Mean," is

coefficient of variation,

ple standard deviation to the (

=

\

defined as s/y/n.

The next statistics

for

part of the

SAS output summarizes

continuous variables

the

most popular summary

in the section entitled

"Basic Statistical

Measures." Several measures of location are provided (the mean, median, and model as are several measures of dispersion (standard deviation, range, and is 160 — 100, or 60; the interquartile range, and third quartiles (Q, — Q,) is 20. The most appropriate measures of location and dispersion depend on whether there are outliers in the data set. If there are no outliers, the mean and standard deviation are the most appropriate measures of location and dispersion, respectively. If there are outliers, the median and interquartile deviation (defined as Q3 - Q\)/2)) are the most appropriate measures of location and dispersion,

The range

interquartile range).

the difference between the

first

(

respectively.

The next

part of the

SAS output contains

"Tests for Location,"

these will be discussed in detail in Chapter 5.

SAS output

After "Tests for Location,"

displays the "Quantiles" (or per-

where the kth quantile is defined as the score that holds k°o of the data below it. For example, the maximum value is equivalent to the 100th quantile and equal to 160 in SAS Example 1, the 75th quantile is equivalent to the third quartile 130), and so on. SAS also presents the 99th quantile, which is equal to 160 in SAS Example 1. Since this is a small data set, these fine classifications are unnecessary and not meaningful. SAS then prints the "Extreme Observations" in the data set. In particular, the five smallest and five largest values are printed. Next to each value, in centiles) of the variable,

(

parentheses,

the observation

is

number

(i.e.,

the position of the observation in

the data set); for example, the smallest value

among

Example 2.2

is

100, the fourth observation

the seven.

Summary

Statistics

on Total Cholesterol Levels

Eight subjects are randomly selected from a population of patients with hypertension. Total serum cholesterol, in

mg/100 ml,

is

measured on each sub-

lea and the sample data are

212

197 In the following,

duced

in

was

(as

I

we summarize

xample

1S4

211

233

245

219

the cholesterol data using the statistics intro-

2.1. Since total

serum cholesterol

blood pressure), the same

systolic

260

is

statistics will

a

continuous variable

be computed.

We

will

our discussion here to items and concepts that were addressed in detail xample 2. Again, a small sample si/e is used to illustrate the calculation

limit in

1

1

.

computation time to a minimum. same techniques can be applied to larger samples.

of descriptive statistics to keep actual practice, the

In

32

Chapter 2 Summarizing Data

We

will let the variable

X

denote total serum cholesterol. The table

dis-

plays the data elements, which have been ordered from smallest to largest,

along with the value of each data element squared (which

is

used

in the

com-

putation of the sample variance):

X

X2

184

33,856

197

38,809

211

44,521

212

44,944

219

47,961

233

54,289

245

60,025

260

67,600

392,005

1761

The number of level in this

sample

patients, or

sample

size, is

n



8.

The mean

cholesterol

is

x=£^

1,761

Notice that some of the total cholesterol

220.1 levels are

above the mean and others

Example 2.1. The sample mean represents a typical cholesterol level in this sample. To address dispersion in the sample, we use the computational formula for are below. This will always be true, as displayed in Figure 2.2 using

the sample variance (2.4):

2

_"

E*

2

- (E x 2 /» _ 392,005 ~ )

n-

392,005

7

1

-

2

(l,761) /8

(3,101,121)/8

392,005 - 387,640.125

7

4364.875

=

7

623.6

7 Generally, the variance ple standard deviation

is

is

not used to summarize dispersion. Instead, the sam-

computed: s

=

V623.6

=

25.0

The sample standard deviation represents how far each total cholesterol level is from the mean of 220.1. Again, by itself the standard deviation is often difficult to interpret. In particular, it is generally difficult to quantify what value of a standard deviation is considered large and what value is considered

2.2 Descriptive Statistics

and Graphical Methods

33

knowledge of the characteristic under inwhat is large and small with respect to the

small. Individuals with substantive

vestigation might have a feel tor

standard deviation.

Our standard summary

=

X= 220.1,

of the cholesterol levels

=

continuous variable)

(a

The maximum cholesterol level is 260 and the minimum is 184. The range is 260 — 184, or 76. There is a substantial difference between the smallest and largest cholesterol levels, a difference is

n

8,

and

5

25.0.

of "6 units. Recall that the range the range

The median

deviation. levels

less useful as a

is

above

and

t«4

Step 2:

r«4

Step

r«4-

3:

233

245

260

219

233

245

260

219

236

246

260

197

211

212

219

W=

211

212

244

212 it

it

—H^

Two Middle Because the number of observations

two middle

mean

values.

of the

When

two middle

sample

in this

the sample size

Values

Example

2.2,

50%

even (n

is

even, the

is

median

= is

8),

there are

defined as the

rallies:

Median = In

it,

to arrive at the median, or middle, value:

left

Stepl:

of the cholesterol

can be computed by and successively counting from

of the cholesterol levels below

arranging the data from smallest to largest, the right

50%

value, or the value that holds

50%

and

it

is another measure of dispersion. In general. measure of dispersion than the sample standard

=

215.5

of the cholesterol levels are above 215.5 and

cholesterol levels are below 215.5.

50%

of the

We now

have two statistics that represent a typical cholesterol level, the sample mean and the sample median. Although they convey different information, in general only one is necessary. Which is the best statistic to address location in this sample? Reviewing the cholesterol levels in the sample, there do not appear to be

outliers at either extreme, thus the

mean

a better

is

measure of location. An

individual with clinical expertise would, however, be in a better position to

make

that assessment. In

outliers based

on

the sample size to largest

median.

and

In

is

to

Example

2.

will present guidelines for assessing

small .\nd

it

is

count into the middle from right and

applications where the sample size

those value(s).

When

their

m

Example

2.\.

easy to order the data elements from smallest

method described

the position's! of the middle value(s)

and

we

formulations. In this example, as

cient to use the two-step

values,

J

statistical

m

m

is

left

larger,

Example

2.

1

to determine the

may

it

.

first

the ordered data set

number of observations is even, positions are computed as follows: the

+ )

more effiwe compute

be

and then locate

there are

two middle

34

Chapter 2 Summarizing Data

For Example 2.2, n

8

4th position

1-1+1=1^) The median

is

ordered data

mean

the

set

=

5th position

of the observations in the 4th and 5th positions in the

[212

(i.e.,

1

+ 219]/2 =

215.5).

To further describe the sample, we now compute the quartiles. As noted in Example 2.1, the best way to determine the quartiles is to first compute the positions of the quartiles in the ordered data set and then to locate the values. When the number of observations in the sample is even, the positions of the quartiles are determined

by the following formula: n

+2 (2.8)

For Example 2.2, 'n

+ 2'

"8

=

4

+

2" "10"

4

=[2.5]

-r

=

2

The quartiles are in the second positions from the top and bottom of the ordered data set. The first quartile is Qj = 197 and the third quartile is Q 3 = 245.

SAS Example

2.2

Summary

Statistics

The following

we produced the tensive summary tics

on Total Cholesterol Levels Using SAS

descriptive statistics

abbreviated illustrated in

were generated using SAS. For Example 2.2, statistics as opposed to the more ex-

summary

Example

were produced by SAS Proc Means

Section 2.4 for

more

2.1.

The following

(see the

details) using the data in

descriptive statis-

following interpretation and

Example

2.2.

A

brief interpre-

tation appears after the output.

SAS Output

for

Example 2.2

Summary Statistics The MEANS Procedure Analysis Variable chol total serum cholesterol :

Mean

Std Dev

Minimum

Maximum

220.1250000

24.9710547

184.0000000

260.0000000

and Graphical Methods

2.2 Descriptive Statistics

Interpretation of

SAS Output

The SAS Means Procedure

Many

tinuous variable.

shown

is

"N"

Example 2.2

for

used to generate descriptive

different statistics

are the default statistics.

35

The sample

statistics

on a con-

can be requested; the size

8 (notice that

is

SAS

statistics

uses up-

opposed to lower case "n" to denote sample size), the sample mean is 220.1, and the sample standard deviation is 25.0. The minimum and maximum cholesterol levels are 184 and 260. SAS displays summary statistics to eight decimal places by default. Users should round appropriately to report percase

summarv

Example 2.3

as

statistics.

Summary Statistics on Ages A sample of 51 individuals is selected vascular risk factors.

for participation in a study of cardio-

The following data represent

the ages of enrolled indi-

viduals measured in years (continuous variable). Here, age

usual

way with

he/she turns 66.

71

62 66 72

63 66 72

76

77

77

60 66

a person being recorded as 65, for

The data 64 67 73 77

The number of size

than

in

is

measured

example,

in the

until the

day

are as follows:

64 67 73 77

65 68 73 79

65

67 73 77

subjects, or

sample

previous examples. Here

65 68 73 82

68 75 83

85

65 70 75 85

66 70 75 87

66 71

76

=

n 51, a much larger sample not possible to interpret the age data

size, is

it is

65 70 75

65

simply by inspecting the values. Instead,

we need summaries

of location and

dispersion.

The mean age

in this

sample

is

x=^

3,637

=

71.3

51

we will compute the sample standard deviation. As a first step, we compute the sample variance using the computational formula presented in Example 2.1: In order to assess dispersion in the sample,

26 1,439 -(3,637) 2 /51 (51

The sample standard deviation s

1

magnitude

f

=

41.4

is

=

v 4

In general, participants ages deviate

th.u the

-1)

1

.4

= 6.4

from the mean of 71.

the standard deviation oi the ages

is

5

bj 6.4

wars. Notice

smaller than the stan-

dard deviations we computed on the systolic blood pressures

in

Example

2.1

36

Chapter 2 Summarizing Data

(s SBP = 19.4) and on the cholesterol levels in Example 2.2 (s CHOL = 25.0). Standard deviations are interpreted relative to their scale of measurement. In Example 2.3, the participants are very homogeneous with respect to age. It is

possible that the study objectives were focused on individuals 60 years of

age or older. Most, sion criteria,

summary

if

not

all,

studies have very explicit inclusion

which must be recognized

in

and exclu-

order to appropriately interpret

statistics.

As noted earlier, in many publications and research reports, investigators do not present raw data (i.e., observations measured on each member of a sample); instead, they present summary statistics. Suppose that we did not have access to the actual ages of each participant here, and that instead we had only the summary statistics: n = 51, X = 71.3, and s = 6.4. The mean and standard deviation are used to understand where the data are located and how they are spread. The Empirical Rule can be used to learn more about a particular characteristic based on these commonly available statistics (discussed further in Chapter 4):

Empirical Rule

68% of the observations fall between X — s and X + s. Approximately 95% of the observations fall between X— 2s and X+2s. Approximately

Approximately

all

Using the data

in

of the observations

fall

between

X— 3s

and

X+ 3s.

Example 2.3

the Empirical Rule indicates that approxibetween 71.3 - 6.4 = 64.9 and 71.3 + 6.4 = 77.7, approximately 95% of the ages fall between 58.5 and 84.1, and almost all of the ages fall between 52.1 and 90.5. Because we have the actual observations here, we computed the percentages of 51 observations that actually fell into each range. The following table illustrates how closely the Empirical Rule approximates the distribution of ages in this sample:

mately

68%

of the ages

fall

Empirical Rule

Range

Percent of Observations

64.9-77.7

Approximately

68%

78.4%

58.5-84.1

Approximately

95%

94.1%

52.1-90.5

Almost

all

Percent of Sample Data

100%

The Empirical Rule suggested that approximately 68% of the ages would fall between 64.9 and 77.7. In Example 2.3, 78.4% of the ages actually fell between 64.9 and 77.7. Similarly, the Empirical Rule suggested that

2.2 Descriptive Statistics

approximately

Example

2.3,

95%

94.1%

would

of the ages

of the ages actually

52.

1

and 90.5, and

in fact

The computation

of the

Summary

Statistics

all

of the ages

mean and standard

cedure to generate descriptive

2.3

between 58.5 and 84.1. In between 58.5 and 84.1. Finally,

would

fall

between

they do.

with a sample of 51 observations.

SAS Example

37

fall

fell

the Empirical Rule suggested that almost

and Graphical Methods

We now

cumbersome SAS Proc Univariate prodata in Example 23. deviation are

use the

statistics for the

on Ages Using SAS

The following descriptive and the data in Example

statistics

2.3.

An

were generated using SAS Proc Univariate components

interpretation of the relevant

appears after the output.

SAS Output

for

Example 2.3

Summary Statistics The UNIVARIATE Procedure Variable: age (age in years)

51

Mean Std Deviation Skewness Uncorrected SS =riation

71.3137255 6.4358067 0.57609039 261439 9.02463958

Moments Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean

51

3637

41.4196078 -0.2678579 2070.98039 0.90119319

Basic Statistical Measures Location Variability Mean 71.31373 Std Deviation 6.43581 Median 71.00000 Variance 41.41961 Mode Range 65.00000 27.00000 Interquartile Range 10.00000

Test Student's Sign Signed

t

Tests for Location: Mu0=0 -Statisticp Value 79.13256 Pr > It = = 663 8). The binomial formula and Table B.l produce the probability of observing exactly x successes out of n. In this example we are interested in the probability of observing more than 8 successes. Before using the binomial formula (or Table B.l), we must restate our problem in a format compatible with patients in

the antibiotic

is

effective),

the binomial formula (and Table B.l):

P(X>

8)

= P(X=9or X =

10)

= P(X=9) + P(X=

10)

1.4

We

can

now compute

X=

P{

and P(X

9)

=

The Binomial Distribution

107

10) using the binomial formula

(3.6), applied twice.



P(X=9)= 9!(

P(X=

10)

=

P(X>

Thus,

10

!



0.50

—-^—— 8)

=

more than

-0.50)'°- 9

(1

0.0098

10

(l-0.50)

+ 0.0010 =

8 patients

experiment are the values

0.50

T

In this application, there tive in

9

=

10(0.0020X0.5)

=

0.0098

9)!

,

0.0108

°- 10

=

1(0.0001)(1)

(see also

=

0.0010

Table B.l).

1.1% chance that the antibiotic will be effecwhen given to 10. The possible outcomes of this

is

a

in the

sample space, S

=

{0, 1, 2, 3, 4, 5, 6, 7, 8,

9, 10}. Because the antibiotic is 50% effective for any given patient, we are more likely to observe 4, 5, or 6 successes out of 10 than we are to observe few

or 2) or

(0, 1,

many

(8, 9, 10)

successes (see Table B.l).

Example 3.4 from Section 3.4 in which we had a population contwo diseased and two nondiseased. In the example we selected two subjects at random. Each selection resulted in one of two possible outcomes: selection of a diseased subject or selection of a nondiseased subject. The probability that we selected a diseased subject was 0.50 under the sampling with replacement strategy. Note that sampling with Recall

sisting of four laboratory mice,

replacement ensures constant probability of success bility

of selecting a diseased subject

ple with replacement).

success. (In general, larly in

we

is

Suppose we call the

(in this

example, proba-

0.5 on each selection as long as

call the selection

outcome of

we sam-

of a diseased subject a

interest a success; often, particu-

medical applications, the outcome of interest

is

not the healthy out-

The random variable X under the sampling with replacement strategy illustrated in Example 3.4 is an example of a binomial random variable, whose probability distribution is given below. Recall

come,

e.g., disease,

we developed

mortality).

this probability distribution

sample, assigning the value of

by enumerating every possible

X (the number of diseased subjects selected

into

each sample) and summarizing:

X

P(X) 16

=

0.25

8/16

=

0.50

4

=

0.25

A 1

2

16

is .in example of a binomial probability distriand p = 0.5. The binomial distribution formula and same probabilities; tor example, with n — 1 .md p = 0.5,

This probability distribution bution with n



Table B.l give the

1

108

Chapter 3 Probability

the probability of selecting

P(X=

0)

2

=

0!(2

^

0.50°(1

random

X

variable

(i.e.,

a Example

3.5,

we

was

=

1(1)(0.25)

involved

70%

mean and variance of computed as follows:

=

ix

In

2-

0.50)

is

=

0.25

can be shown that the mean and variance

it

the

cesses out of n binomial trials) are

ability of success

-

0)

Q

In the binomial distribution,

of the

=

no diseased subjects (X

number of

np

suc-

(3.7)

= np(l-

2

the

p)

five patients in the

experiment, and the prob-

The

possible outcomes of the

for each patient.

experiment are given in the sample space S — (0, 1, 2, 3, 4, 5}. Each time the experiment is run, exactly one of these outcomes (number of successes) is observed.

The mean

the antibiotic

is

number of

(or expected)

effective

is /x

= np = 5

0.7

patients (out of five) in

=

3.5.

We

3.5 successes in any performance of the experiment; instead,

of the values in the sample space

The variance

in the

standard deviation

(e.g.,

whom

could never observe

we observe one

exactly 3 or 4 or 5 successes out of 5).

number of successes is a 2 = 5(0.7)(1 — 0.7) = 1.05. The a — 1.02. The mean number of successes represents the

is

number of successes. Example 3.5), we see that

(shown outcomes (those with the highest

typical

In reviewing the probability distribution

in

the

most

likely

and 4 successes. where the probability of success was

probabilities) of the experiment are 3 In

Example

3.6,

patient, the expected otic

is

effective

is

/u

number of

=

np

=

patients (out of 10) in

3.5

likely

= 5. Again, any of the outcomes in on any performance of the experiment.

10(0.5)

the sample space can be observed

We are most p = 0.50.

50% for each whom the antibi-

(or expect) to observe 5

successes out of 10

when

The Normal Distribution distribution is our second probability model and is the most widely used probability distribution for continuous random variables. A char-

The normal acteristic

(continuous variable)

distribution of values

is

bell-

is

said to follow a

normal distribution

if

the

or mound-shaped (Figure 3.4).

The horizontal axis displays the values of the continuous normal random The vertical axis is scaled to accommodate the height of the curve at each value of the random variable that reflects the probability (or relative frequency) of observing that value. The total area under the normal curve variable X.

3.5 The

Figure 3.4 The Normal

X=

is

Normal

109

Distribution

Distribution

normal random variable

1.0, as

it is

a probability distribution.

The normal distribution is one where more likely (have higher probabili-

values in the center of the distribution are

than values at the extremes.

ties)

The mathematical formula

f(x)= where x

is



random

a continuous

normal probability distribution

for the

'

e

t)la]2

variable (— oc

= mean of the random variable X o = standard deviation of the random

is

< x
/j.) = P( X < n) — 0.5, see Figure 3.5a.) A characteristic that follows a normal distribution is one in which there are as many values above the mean as below.

2.

=

= the mode. This attribute follows directly from symmetric at the mean, then half (50%) of the values are above the mean and half (50%) are below the mean. This is the definition of the median. Notice in Figure 3.5a that the peak of the

The mean I

.

If

the

median

the distribution

distribution

is

exactly

is

in

the center of the distribution (at the

mean =

median). The height of the curve indicates the probability (or relative frequency) of observations at each point. The peak indicates the most frequent value, which

V

I

Ik-

mean and

distribution.

It

is.

variance,

we know

by definition, the mode. //

and

n-2

,

completer) characterize the normal

that a particular characteristic follows a

normal distribution and we know // and a, then we know everything about that distribution. The mean and variance arc the only parameters

IIO

Chapter 3 Probability

Figure 3.5(a)— (c)

Properties of the

Normal

Distribution

P(a

=

0.7 (see

Example

J.5).

I

he second example

mean

The thud example

illus-

computation ot percentiles xample J. 8. presented m I

use.

variables from a normal distribution with

ION and standard deviation 14 (see Example trates the

its

of the

J. 8).

normal distribution using the data

;

126

Chapter 3 Probability

For each example, we present three components:

SAS program code

1.

the

2.

the computer output

3.

components of

a description of the relevant

the

computer output alonj

with their interpretation

SAS Example

3.5

Generating

Random

Variables from the Binomial Distribution

Effectiveness of Antibiotic

(Example

3.5)

Ranbin Function (Generates Random Variables from Binomial Distribution)

Suppose an antibiotic has been shown to be bacteria. If the antibiotic

the probability that

it

In this application,

is

will be effective

we

70%

common

effective against a

given to five individuals with the bacteria, what

will use

=

binomial distribution with n

5

SAS

on exactly

is

three?

to generate

random

(number of trials) and p

variables

=

from the

0.70 (probability

on any trial). We will generate many such random variables, and in doing so estimate the theoretical probability distribution. The SAS program of success

code follows.

Program Code options ps= 64 ls=80; data one;

Formats the output page to 64 lines

do i=l to

Beginning of a

5

000;

Beginning of Data Step (Data set

DO

loop which

in length.

name

will

80 columns

in width.

one).

be repeated 5000 times. The index

i

counts the iterations.

x=ranbin(21439,5,0.7

Generates a random variable, called with n=5, p=o.7. The

output

first

Writes the generated variable,

end; run;

End of DO loop.

proc means; var x;

Procedure

run;

End of Procedure section.

proc freq;

Procedure

End of Data

tables x;

from the binomial distribution

x,

is

the seed.'"

to data set one.

Step.

call

(Proc

Means

to generate

summary

statistics).

Specification of variable x.

call

(Proc Freq to generate frequency distribution=probabiln\

distribution).

run;

x,

argument to the Ranbin function

Specification of variable x.

End of Procedure section.

3.7 Applications Using S AS

127

The seed is simply a random number used as a starting point for the random number generator (i.e., ranbin). When the seed is changed, different (1)

values are generated.

Computer Output The MEANS Procedure x Analysis Variable :

N

Mean

Std Dev

5000

3.4942000

1.0282888

X

Frequency

1

135 686 1529 1798 837

3

4 5

Maximum 5.0000000

The FREQ Procedure Cumulative Percent Frequency 0.30 2.70 13.72 30.58 35.96 16.74

15 2

Minimum

Cumulative Percent 0.30 3.00 16.72 47.30 83.26 100.00

15

150 836 2365 4163 5000

Interpretation

The

first

part of the output

is

from the Means procedure. Shown are the num(N — 5000), the mean, standard deviation,

ber of observations in the sample

minimum, and maximum. In this application, the summary statistics are computed on the random variable X, which reflects the numbers of patients in

whom

the antibiotic

is

effective.

Recall for the binomial distribution, the

mean is given by // = up. In Example 3.5, n = 5 and p = 0.7. The theoretical mean is (x = 5 0.7 = 3.5. In this data set in which we generated 5,000 random variables from a binomial distribution with n = 5 and p = 0.7, the observed mean is -5.494. The next section of output displays a frequency distribution table for the random variable X. Notice that the values of X generated by SAS are between and 5. The observed frequency distribution (based on 5000 observations)

closely

bution (see Example

approximates J.5).

the

true

binomial

probability

lor example, from Table R.l

m

distri-

Appendix

B,

;

128

;

;

;

Chapter 3 Probability

P(X=0) =

P(X =

0.0024,

0.3087,

P(X=

(shown

in the

4)

=

SAS output)

1)

address the original question:

given to five individuals, what exactly three?

From

the

P(X=

3)

=

distribution

0.003, 0.027, 0.137, 0.306, 0.360, and 0.167.

is

we can

Using the SAS output,

=0.0284, P(X=2) = 0.1323, 5) =0.1681. The observed

P(X=

0.3601,

the probability that

is

P(X=

SAS output,

=

3)

it

the antibiotic

If

is

will be effective in

0.306.

SAS Example 3.8 Generating Random Variables from the Normal Distribution Systolic Blood Pressures (Example 3.8)

Rannor Function (Generates Random

Variables from the Standard

Normal

Distribution) Systolic

mean

blood pressures are assumed to follow a normal distribution, with a

of 108 and a standard deviation of 14.

In this application,

we

SAS

will use

random

to generate

variables

from the

standard normal distribution (Z) and then transform them into normal ran-

dom

variables with a

mean

of 108 and a standard deviation of 14.

Program Code options ps = 64 ls=80; data one; mu=108; sigma=14

Formats the output page to 64 lines Beginning of Data Step (Data set

in length.

name

80 columns

in

width

one).

Create a variable called mu, assign the constant 108 Create a variable called sigma, assign the constant

do i=l to 10000;

Beginning of a

DO

loop which

will

14

be repeated 10,000 times. The index

i

counts the iterations.

z= rannor (137 55

Generates a random variable, called

)

x=mu+( z*sigma)

Creates a

;

output

2,

new variable,

which

x.

is

a linear function of z (x

Writes the generated variables, x and

end; run;

from the standard normal

The only argument to the Rannor function

distribution.

z.

is



the seed.'" fi

+

z

a).

to data set one.

End of DO loop.

End of Data

Step.

proc chart vbar z/type=pct; vbar x/type=pct;

Specification of variable

z.

type=pct displays relative frequencies

Specification of variable

x,

type=pct displays relative frequencies

run;

End of Procedure section

proc means; var z x;

Specification of variables z

run;

End of Procedure section

Procedure

Procedure

(1)

The seed

dom number

is

simply a

generator

ues are generated.

call

call

(Proc Chart to generate relative frequency histogram)

(Proc

Means

to generate

and

random number used (i.e.,

rannor).

When

summary

statistics)

x.

as a starting point for the ran-

the seed

is

changed, different val-

1.7 Applications

Using SAS

Computer Output Percer.r age

10 +

9

+

6

+

2

+

1

+

33322211110000001111222333 52074185207411470258147025 79013467901344310976431097 Z

Midpoint

129

Chapter 3 Probability

130

Percentage 11 + *

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

10 +

9

7

6

+

+

+

5

+

3

+

2

+

1

+

***** ***** ***** ***** ***** ***** ***** ***** ****** ****** ****** ****** ****** ****** ******* ******* *******

***** ***** ***** ***** ***** ***** ***** ***** ****** ****** ****** ****** ****** ******* ******* ******* ******* ******** *******

******** ********* ********* ********* *********** *********** 111111111111111

45556677788999001112233344555 60482604826048260482604826048 X Midpoint

3.7 Applications Using SAS

131

The MEANS Procedure N

Mean

Std Dev

Minimum

Maximum

10000 10000

0.000430808 108.0060313

0.9914063 13.8796876

-4.4742924

3.6632627 159.2856784

Variable

X

-599064

Interpretation is a relative frequency histogram for the random The random variable z follows the standard normal distribution, with a mean of and a standard deviation of 1 The range of the observed values of z is listed along the horizontal axis and is approximately —3.0 to 3.0. Although the figure produced by SAS Proc Chart is somewhat crude, it does

The

part of the output

first

variable

c.

.

resemble the normal distribution curve.

The second

dom

figure displays the relative frequency

histogram for the ran-

The random variable x follows a normal distribution, with a mean of 108 and a standard deviation of 14. The range of the observed values of x are listed along the horizontal axis and is approximately 66 to 160. Again, the figure produced by SAS Proc Chart is crude, but it does variable x.

resemble the normal distribution curve. In fact, it is identical to the relative frequency distribution of c, only the values along the horizontal axis are different.

The next section of output is from the Means procedure. Shown are the number of observations in the sample (N= 10,000), the mean, standard deviation, minimum, .\\m\ maximum. Summary statistics are computed on both random variables ; and x. Notice that the observed mean of z is 0.0004 1; the theoretical mean of z is 0. The observed standard deviation of z is 0.99141; the theoretical standard deviation is 1. The observed mean of x is 108.006; the theoretical mean of .v is ION. The observed standard deviation of _

Z is

SAS Example

3.9

1

i.X

9~; the theoretical standard deviation

Computing S\stolic

Percentiles of the

Normal

Blood Pressures (Example

is

14.

Distribution

3.9)

unction (Returns Percentiles of the Standard Distribution I'robit

In

I

I

xample

3.9

we computed

blood pressures illustrated

in

the 5th

L

-*()th

percentiles of the systolic

the previous example.

we will use SAS to determine the 5th and 90th perblood pressures, which are assumed to follow a normal mean ot IDS \n^\ a standard deviation ot 14.

In this application,

centiles ot the systolic

distribution, with a

and

Normal

t

;

;

132

Chapter 3 Probability

Program Code options ps=64 ls=80; data one; mu=108; sigma=14 z05=probit (0.05)

Formats the output page to 64 lines

in length.

name

Beginning of Data Step (Data set

80 columns

in

width

one).

Create a variable called mu, assign the constant 108 Create a variable called sigma. assign the constant Creates a variable called Z05, which

;

14

is

assigned the 5th percentile of the

is

assigned the 90th percentile of the

standard normal distribution.

z90=probit (0.90)

Creates a variable called Z90, which

;

standard normal distribution.

x05=mu+ (z05*sigma)

Creates a

X05

new

variable, X05.

standard deviation of

x90=mu+ (z90*sigma)

Creates a

X90

is

which

is

a linear function of Z05 (x

the 5th percentile of a normal distribution with a

is

new

mean

=

/i

-

of 108 and a

14.

variable, X90,

which

is

a linear function of

Z90

the 90th percentile of a normal distribution with a

a standard deviation of

run;

End of Data

proc print;

Procedure

run;

End of Procedure section

(x

=

mean

\i



z*n

Suppose we increase our sample

=

=

a

fj.

10

_

10

Notice that under each sampling strategy the equal to the population mean. However,

when

=

2

mean

of the sample

the sample size

is

means

is

increased

from 4 to 25 the standard error (or variation in the sample means) is reduced from 5 to 2. When the samples are larger in size, there is less variation in the sample means; the sample means are more tightly clustered about the population mean.

4.2

The Central Limit Theorem The following mathematical theorem in statistics.

is

perhaps the most important theorem

4.2

Central Limit

we

153

Theorem

Suppose we have If

The Central Limit Theorem

take simple

a population with

random samples

mean n and standard

deviation a.

of size n with replacement from the

population, for large «, the sampling distribution of the sample means is

approximately normally distributed with

where,

in general,

n

> 30

is

sufficiently large.

The Central Limit Theorem (CLT)

we

will

single

make

important because

is

sample mean. The

CLT

bution of the sample means

mean

in statistical inference

based on the value of a states that for large samples (n > 30), the distri-

inferences about the population

(n

)

approximately normal. If the population is norn. If the population is binomial, then the

is

mal, then the results hold for any size

n{ 1 — p) > 5. In Chapter 3, we normal random variables by standardizing

following criteria are required, np > 5 and

computed

probabilities about

CLT tells

we can use that mean even if the population is not normal. This will be useful in statistical inference because when we make inferences about population parameters based on sample statistics, we (transforming to Z) and using Table B.2. The

same process

to

compute

will attach probability

To

statements that quantify the precision in our inferences.

reinforce the results of the Central Limit

illustrations. In

us that

probabilities about the sample

each illustration

we

Theorem, we now present three

display the population distribution along

with the sampling distributions of the sample means based on samples of size 5,

>,

1

30, and 50.

We

display the distributions graphically and present para-

The

meters associated with each

in

Figures 4. 1—4.3. Figures 4.

and 4.2 depict nonnormal populations; Figure 4.3

illustrates the

1

tabular form.

illustrations are presented in

normal population distribution. Figures 4.1a^Lld display the

sampling distributions of the sample mean based on s.r.s. of size n =5, n= 15, n = 30, and n = 50, respectively, from the population displayed in Figure 4.1. Similar graphs are shown for sampling distributions from the populations displayed in Figures 4.2 and 4.3. Notice for the normal population distribution

mean are normal for nonnormal cases (Figures 4.1 and 4.2), the sample means approach normality as the sample

(Figure 4.3), the sampling distributions of the sample

each sample

size considered. In the

sampling distributions of the size increases (e.g., n > 30).

The mean of

the uniform population

standard deviation of 57.7.

When

shown in Figure 4.1 is 100, with a random samples are drawn, the

simple

means of the sampling distributions for samples of each size are equal to 100. The standard deviations of the sample means, or the standard errors, decrease as the

sample

size increases (see

Table 4.2 and Figures 4.1a-d). Notice

the shapes of the sampling distributions of the sample

approximatelv normal

for

sample

sizes of

30 or

larger.

means

start to

how look

Chapter 4 Sampling Distributions

154

Figure

4.1 Uniform Population

2520-

15

10-

5-

75 80 85 90 95 100 105 110 115 120 125 200

Table 4.2 Uniform

Population Sample

Population



Population distribution

Oy:

Hx =

Sampling distribution

Ji?

fix= 100

a^= a^=

10.5

Sampling distribution

xi^

Hjl= 100

a^ =

8.2

Figure 4.1b Sampling

Distribution of

X,n=

15

15

iilllllttb. I

— sin = 25.9

[ix= 100

5

20-

I

a

r

20

I

100

Jl5

25-

I

=

Sampling distribution

25-

1

li

Sampling distribution

Figure 4.1a Sampling X,n = 5

U

SD

Mean

Size

I

I

I

I

I

I

I

I

I

70 75 80 85 90 95 100105110115120125130135

U

100

14.9

Distribution of

15

jriflb.

1

I

I

!

1

i

70 75 80 85 90 95 100105110115120125130135

4.2

Figure 4.IC Sampling

Figure

Distribution of

= 30

X. n

The Central Limit Theorem

155

4. id Sampling Distribution of X, n = 50

0^

r 70 75 80 85 90 95 100105110115120125130

70 75 80 85 90 95 100105110115120125130135

Figure 4.2 Skewed

Population

25 50 75 100125 150 175 200 225 250 275 300

Table 4.3 Skewed Population

Population Sample

Size

Sampling distribution

— JT

Sampling distribution

fs

Population distribution

SD

=

a = 100

li

100 KID

23

B

is:

Sampling distribution

Sampling distribution

Mean

fo

14 2

156

Chapter 4 Sampling Distributions

We now consider a the right, with

second nonnormal population, one which is skewed to most observations clustered at the low end of the distribution

(Figure 4.2).

The mean of

the

skewed population shown

standard deviation of 100.

When

simple

in

Figure 4.2

random samples

is

100, with a

are drawn, the

means of the sampling distributions for samples of each size are equal to 100. The standard deviations of the sample means, or the standard errors, decrease as the sample size increases (see Table 4.3 and Figures 4.2a-d). Again, notice how the shapes of the sampling distributions of the sample means start to look approximately normal for sample sizes of 30 or

The

larger.

example involves a normal population (Figure 4.3). Notice that the sampling distributions of the sample means are approximately normally

Figure 4.2a Sampling X,n = 5

last

Distribution of

Figure 4.2b Sampling

Distribution of

X,n=\5 70-

70 60 50 H 40 30 20

ion

25 50 75 100125150 175 200 225 250 275 300

Figure 4.2C Sampling X, n

=

Distribution of

30

25 50 75 100125150 175 200 225 250 275 300

25 50 75 100125150 175 200 225 250 275 300

Figure 4.2d Sampling X, n

=

Distribution of

50

25 50 75 100125150 175 200 225 250 275 300

4.2

Figure 4.3 Normal

The Central limit Theorem

157

Population

30-

70 75 80 85 90 95 100 105 110115 120 125 130

Table 4.4 Normal Population Population Population distribution

Sample

SD

Mean

Size



=

100

n^=

100

u

Sampling distribution

5

Sampling distribution

15

(ix= 100

Sampling distribution

30

Mx = 100

Sampling distribution

50

Mx=

Figure 4.3a Sampling X, »

Distribution of

100

Figure 4.3b Sampling

=5

X,«=

or

= 10

a^ =

4.5

=

2 6

CT

x

-

"x = l-8 a* =1.4

Distribution of

15

7060504030

-

2010-

——

i 70 75 80 85 90 95 100 105 110 115 120 125 130

i

70 75 80 85 90 95 100 105 110 115 120 125 130

158

Chapter 4 Sampling Distributions

Figure 4.3C Sampling Distribution of X,n = 30

Figure 4.3d Sampling X, n

1

i

1

=

Distribution of

50

r 70 75 80 85 90 95 100105 110115 120 125 130

70 75 80 85 90 95 100 105 110 115 120125 130

distributed for even small sample sizes

(e.g.,

n



5). In

Figures 4.1 and 4.2, the

sampling distributions of the sample mean looked normal only when the sample size

Example 4.3

was 30 or

greater.

Applying the Central Limit Theorem: Non-Normal Population Telephone calls placed to a drug hotline during the hours of 9-5 weekdays have a mean length of 5 minutes with a standard deviation of 5 minutes. The distribution of

all calls (i.e.,

X=

Suppose the population of three calls {n

=

3).

the population)

Duration of

is

5>

of the form:

minutes

very large, and

The following

My =

call,

is

we

take simple

are true by (4.1):

o-y

=

— =2.89

random samples

4.2

Similarly,

it

we

take

s.r.s.

of size n

=

The Central Limit Theorem

100, then by

5

Mx =

=

5,

(4.

1

159

(:

0.5

By the Central Limit Theorem the distribution of the sample means for size n — 100 is approximately normal as shown here:

samples of

The

phone calls (X) to the drug hotline was not normally distributed. The population

distribution of the lengths of

callers (the population)

for all distri-

bution was skewed to the right, with the majority of calls lasting a shorter

time and fewer calls lasting a longer time. calls at a calls

(X)

time (samples of n is

Most of



When we

look at collections of 100

100), the distribution of the

normallv distributed

(as

mean

length of

shown).

the topics addressed in subsequent chapters deal with statistical

sample from the population of interest. We summake inferences about unknown population parameters (e.g., //) based on sample statistics (e.g., X). Without knowing the population distribution, as long as the sample is sufficiently large (usually size inference based

on

a single

marize the sample and then

50

is

sufficient),

bilistic

we can appeal

to the Central Limit

Theorem

make probamean and the Theorem states with mean ,\nd to

statements about the relationship between the sample

unknown population mean, lor example, the Central limit mean is approximately normally distributed standard deviation //^ and n v respectively. We can make statements about

that the sample

means

the distribution of sample

normal distribution using

(4.ii

Z=

The following example

ot size n by

and Table B.2

transforming to the standard (see Section

J.6).

X- n a

illustrates the use ot the

(

I

I

and formula

l6o

Chapter 4 Sampling Distributions

Example 4.4

^Applying the Central Limit Theorem: Normal Population — 100 and o = 10. Suppose we have a normal population with /j.

random sample

a simple

If

we take mean

of size 225, find the probability that the sample

between 99 and 102. The problem of interest

falls

stated as follows:

is

X


30),

uted with

Z=

X- Mx _ X-Mo

(5.12)

where no is the mean specified in Hq (i.e., no = 130) From Chapter 3, we know the properties of Z. If Z (5.12) is close to zero, which occurs when X is close to //o = 130, we suspect that Ho is most likely true. When Z is large, which occurs when X is larger than iaq = 130, we suspect that Hi is most likely true. In hypothesis testing, we need to determine the point at which Z is "too large." That point is called the critical value ofZ. Here we know that a = 15 and n =108. Therefore, under the null hypothesis (i.e., when \x = 130), the distribution of the sample means is ap= 130 and standard deviation proximately norm al w ith mean jj.y- = /j.

a-

=

= 15/7108 = 1.44 (Figure 5.5). = 130, it is possible to observe Hq:

in

Figure 5.5. However,

a/y/n

Under

\x

we know

that

it

is

any value of

unlikely

(i.e.,

X

displayed

the probability

is

X will take on a value in the tails of the distribution. For example, unlikely to observe values of X exceeding 132.88 (which 2 stan-

small) that

very

it is

dard

is

deviations

P(X>

132.88)

serve a sample

+ 2a^). Recall from Chapter 4, -0.9772 = 0.0228. Therefore, if we ob-

above the mean: //^

= P(Z>

mean

2)

=

1

that exceeds 132.88

and we

favor of the alternative, the probability that rejecting

is

only 2.28%. However,

if

we

reject the null hypothesis in

we

are

making

a mistake in

reject the null hypothesis for values

X

we are making a mistake in rejecting 1-0.8413 = 0.1587. We must decide of this type where we incorrectly reject

that exceed 131.44, the probability that

Ho

P(X>

is

what

131.44)

= P(Z>

1)'=

level of error (specifically, error

the null hypothesis)

making

this

we can

type of mistake

tolerate in the analysis.

may

A 15.87%

probability of

be too high.

In hypothesis testing, investigators select a level

of significance, denoted a,

defined as the probability of rejecting Hi when Hq is true. The level of significance is generally in the range of 0.01 to 0.10, though any value from

which

is

to 1.0 can be selected. Because the level of significance reflects the likelihood

of drawing an erroneous conclusion levels of significance are

Once a level of The decision rule is

(i.e.,

a

— P (reject

Ho Ho I

true)), small

purposely selected.

is selected, a decision rule is formulated. formal statement of the criteria used to draw a conclusion in the hypothesis test. For example, suppose we select a level of significance of 5% (i.e., a = 0.05, we allow a 5% chance of rejecting Hi when

H)

is

true).

The

significance a

critical

value and decision rule are displayed graphically in

y.

2 Hypothesis

Tests

About

193

/
130,a = 0.05 ,,:

Figure 5.5

X under = L30(a5 = 1.44)

Distribution of H>:

/


not reject Hj

Once the The test

given by

is

1.645

Z
1.645. A final statement is made concerning our findings relative to the research or alternative hypothesis. Such a statement is as follows: We have significant evidence, a = 0.05, that the mean systolic blood pressure for males aged 50 in 2004 has S( increased from rejection region

1

Table I

5. ]

).

summarizes the

steps involved in the test of hypothesis procedure.

he steps are displayed with reference to a

The same

fable 5.4 contains critical values of tests.

test

about the population mean

//.

steps will be used to test hypotheses concerning other parameters.

The general form

of the

the decision rule for each type [able 5.5

/

tor lower-, upper-,

and two-tailed

hypotheses are presented along with the form of t

summarizes three

test.

different formulas tor test statistics in tests

concerning the population mean should be applied.

//

and the conditions under which each

— 194

Chapter 5

Procedures for

Statistical Inference:

Table

(i

5*3 Tests of Hypothesis Concerning

pi

Example

Step

Set

1.

H

up hypotheses.

:

/z

=

H,:

/i

> Mo*

a

Select level of significance.

tci 2. Select appropriate test statistic.

=

/x

0.05

X-mo —

Z=

Z^ H„ if Z
) specified

the alternative hypothesis and the location of the rejection region (see Figure 5.6). The other

of tests are called lower-tailed tests is

is

of the form: Hy/x

< n ;and

in

and two-tailed

tests. In lower-tailed tests,

two-tailed tests, the alternative hypothesis

lower-tailed tests, the research hypothesis indicates a decrease

in

the

mean

research hypothesis indicates a difference (either an increase or a decrease)

Notice (n


1

200

Chapter 5

Statistical Inference:

5.

Procedures for n

Conclusion. In the final step,

(computed

H

not reject

we draw

,

is

by comparing the

because —2.262 < —1.02 < 2.262.

significant evidence,

females

a conclusion

test statistic

in step 4) to the decision rule (displayed in step 3).

a

=

0.05, to

significantly different

show

that the

We do

mean

Do

not have

starting salary for

from $29,500. Are the

starting salaries

the same?

Mean

SAS Example 5.10 Testing

Starting Salary Against Referent Using

SAS

The following output was generated using SAS Proc Means with an option conduct a

SAS /J.

A

to

brief interpretation appears after the output.

conduct a one-sample test of hypothesis in its Means procedure. one-sample test SAS assumes that the test of interest is versus Hi: jx 7^ 0. In this application (and in most others), we are

will

However, Ho:

test of hypothesis.

=

in the

mean of the analytic variable is zero. We want among females is 29.5 (in $1000s, which is equal to $29,500). In order to use SAS to test the desired hypotheses, we create a new variable, which we call TESTSTAT; it is simply our original analytic variable (i.e., starting salary) minus 29.5. Using the variable TESTSTAT, we can use SAS to test the desired hypotheses. In particular, if the mean of TESTSTAT is significantly different from zero, then we can conclude that the mean salary is significantly different from 29.5 (since TESTSTAT is simply equivalent to salary —29.5). Conversely, if the mean of TESTSTAT is not significantly different from zero, then we can conclude that the mean salary is not interested in testing

to test

if

the

mean

if

the

starting salary

not significantly different from 29.5.

SAS Output

for

Example 5.10

The MEANS Procedure N Mean

Variable Label

salary Annual Salary in $000s teststat

10 10

28.2000000 -1.3000000

Variable

Label

salary teststat

Annual Salary in $000s

Interpretation of

t

SAS Output

for

Std Dev

Std Error

4.0496913 4.0496913

1.2806248J

Value 22.02 -1.02

Pr >

1.2806248 It

I

= 0.0194. The graphical

value

is

the

probability ot

202

Chapter 5

Statistical Inference:

Procedures for n

observing a value as extreme or more extreme than the observed that

is,

P(t

> 2.50

or

test statistic;

< -2.50).

t

p = 0.0194

0.0097

t= -2.50

If this test

f=2.50

were conducted by hand,

critical values

would be

selected

corresponding to the preselected level of significance a. The critical values for a two-sided test with a = 0.05 are shown next. Because the test statistic t

=

2.50 exceeds the in favor of

Ho from 30. ject

critical

value

t

=

2.064

H\ and conclude that the

-2.064

The same conclusion

t

(see figure),

mean

is

we would

re-

significantly different

= 2.064

Slgreached by comparing the p value to the level of sig p = 0.0194 < a = 0.05, we rej H>. (Note: For the same test, if a = 0.01, we would not reject H. is

nificance using rule (5.14). Because

I

Notice that when comparing the

test statistic to

a critical value, actual

t

compared, whereas when comparing the p value to the level of significance, areas in the tails of the distribution are compared. = no versus Hi: n > Mo> For the one-sample tests of means (Hq: statistics are

/j.

Hi:

n
30.

COMPUTING p VALUES BY HAND Mean

Example 5.8 Revisited: p- Value in Test for

Example

In

5.8,

The decision

rule

Reject Ho

Do

we

if

we used was

Z>

not reject Hi

w,

We computed

Systolic

Blood Pressure

ran the following test at a Hq:

m=

Hi:

fi> 130

// ,,. In this

level of significance.

130

given by

1.645 if

Z < .

.

1.645

,~ = X-li — = 135-130 = Z

a test statistic of

,

where we

1.645.

still

The p value

reject

//.

.

I

Ising

204

Chapter

5 Statistical Inference:

the table, if

we

Procedures for

we examine each

could

still

reject

Hq

\i

than 0.05 to determine = 0.025 we still re-

level of significance smaller

at that level.

For example, at a

Hq because 3.46 > 1.960. At o- = 0.01 we also reject Hq because > 2.326, at a = 0.005 we reject Hq because 3.46 > 2.576, at a = 0.001 we reject Hq because 3.46 > 3.090, but ata = 0.0001 we cannot reject Hq because 3.46 < 3.791. Therefore, the smallest level of significance where we still reject Hq is 0.001. The significance of this data, or the p value, is 0.001. If we run this analysis using SAS, SAS produces an exact p value. Our hand computations produce only an approximate value (in fact, the exact p value is between 0.0001 and 0.001). To reflect the idea that this is an approximate p value, sometimes the value is reported as p < 0.001. ject

3.46

5.2.3

Power and Sample

Size Determination

There are two types of errors that can be committed in hypothesis testing, a Type I error (i.e., reject Hq when Hq is true), or a Type II error (i.e., do not reject Hq when Hq is false). In Section 5.2.1 we introduced a = F(Type I error) = P(Reject Hq\Hq true) and ft = P(Type II error) = P(Do not reject Hq\Hq false). In each test of hypothesis, we specify a, purposely choosing small values (e.g., 0.01, 0.02, 0.05, or 0.10) so that the P(Type I error) is controlled. The probability of a Type II error, f5, is difficult to control because it depends on several factors. In fact, one of the factors on which /3 depends is a: ft decreases as a increases. Therefore, one must weigh the choice of a lower /? = P(Type II error), which is desirable, against a higher level of significance, which is undesirable. In hypothesis testing, we are concerned with the power of a test, defined as 1



p.

The power of

a test

Power As power fore

power

increases,

/3

is

-$=

1

its

ability to "detect" or reject a false

defined as

P( Reject Ho\Ho false)

decreases, resulting in a better

a complicated function of three

n a.

test.

(5.16)

Power (and

there-

components:

= sample size — level of significance = P (Type error) ES = the effect size = the standardized difference

1.

1. 3.

y0) is

=

defined as

is

null hypothesis. Specifically,

I

in

means

specified

under Hq and Hi

The power of

a particular test

larger level of significance,

is

and

higher (or better) with a larger sample a larger effect size. In this section,

we

duce the concept of statistical power as it applies to the one-sample hypothesis about n and present a simple application. Suppose we are interested in the following test. Hq:

fi

=

100

Hi:

ix

>

100,

a

=

0.05

size, a

introtest

of

v2 Hypothesis

To conduct

the test,

tion of interest

hypothesis

Assume

we

=

(i.e., if /*

that o^-

=

null

About n

205

subjects from the popula-

X. Under the null means is as follows.

statistics, in particular

100), the distribution of sample

6.

Suppose we want 110.

random sample of

select a

and analyze summary

Tests

The following

to determine the

power of

the test

if

displays the distributions of the sample

mean is mean under the

the true

and alternative hypotheses.

X under

X under

H,

80

120

H,

130

We now add or = P(Type error) = P( Reject //. II. true), the corresponding critical value, fi = P(Typc II error) = P(Do not rejeci // // false), and power = 1-/1= P (Reject //. //. false) to the figure, showing the distributions of the sample mean under // >(/i 100) and //t 1.96-

|8

°'^ = P(Z> 1.96-2.36)

9.5/^20/

V

= P(Z >

-0.40)

=

1

- 0.3446 = 0.6554

65% probability that this test will detect a difference of — 20 at a = 0.05. The power is the same if /i = 75. What if the sample size is increased to n = 50?

There

is

a

5 units in

means with n

Power

There

is

a

96%

=

P[

Z>

Zi_ (a /2)

-

-( Z >

1.96

= P(Z>

-1.76)

or/Vw

180-851

=[ = P(Z > ,

1.96

-

3.72)

9.5/^50

=

1

- 0.0392 = 0.9608

probability that this test will detect a difference of 5 units in

with n — 50 at a = 0.05. Suppose we go back to n = 20 and consider a 3-point difference

means (i.e.,

(in either direction)

mi

=

83):

d Power

= Pt>( Z7 I

iMo

7 (a/ 2) > Zi_

= p(z>1.96V

= P(Z>

|80



~^

Mil

l

')

= P(Z>

9.5/v'20 ) 0.55)

=

1

- 0.7088 = 0.2912

1.96-1.41)

in

means

5.2

There

is

only

c)

a 2

l

\>

209

Hypothesis Tests About n

probability that rhis rest will detect a difference or 3 units

means (in either direction) with ;; = 20 at a = 0.05. This is a small difference in means and with a small sample size we are not very likely to detect it. A larger sample would he required to ensure a higher probability or detecting in

such a difference. In the following we describe techniques for determining the sample size required to ensure a specified power. In main applications, the number of subjects that can be sampled depends on financial and/or time constraints. In other cases, the investigators can choose a sample large enough to ensure a certain level of power. As described in Section 5.1.3, techniques from experimental design can be employed to determine the number of subjects required to achieve a certain level of power prior to mounting the study. The sample size required to ensure a specific level of power in a two-sided test

is

where

Zj



2

'

s

from the standard normal distribution with

the value

lower-tail area equal to

Z|_^

is

1



ce/2

the value from the standard normal distribution with

lower-tail area equal to

1



/J

ES is the effect size, defined as the standardized difference means under the null and alternative hypotheses (5.20):

£S= where

fio is

the

H\

the

n

is is

mean mean

l(Ml

-

in

//0)l

(5.20)

a H,

specified in

specified in H\

the population standard deviation of the characteristic under

investigation.

The sample test

size required to

ensure a specific level of power

in a

one-sided

is

*-(MM where Zi_„

is

the value from the standard normal distribution with

lower-tail area equal to Zi-fi

is

lower-tail area equal to I

5 is

1

-a

the value from the standard

the effect size

To implement formulas required to ensure

a

I

1

-

normal distribution with

fl

5.20)

(5.19)

certain level of

and (5.21)

power

to

compute

the

sample

size

to detect a specified difference

m

2IO

Chapter

5 Statistical Inference:

Procedures for n

means, several inputs are required.

we must

First,

usually straightforward because a

cance, a. This

is

dard. Second,

we must

specify

/}

specify the level of signifi-

=

0.05

is

considered stan-

=

P(Type II error). In many experimental 0.20, which reflects 80% power. With

design applications, P is set to — 0.20, there is an 80% chance of rejecting a false null hypothesis relative

ft

to a specific effect size. (In

The

power.)

some

instances,

third input, the effect size,

/S is

is

set to 0.10,

the

most

which

90%

reflects

difficult to specify.

The

magnitude of a clinically important difference in means. The effect size is best determined by an expert in the substantive area under investigation. In order to compute the effect size, we also need to quantify the variation in the characteristic under investigation {a). If no such value exists, the same options outlined in Section 5.1.3 can be used to determine a reasonable approximation. The following example illustrates the computations. effect size reflects the

Example 5.14

Power

in Test of

Hypothesis for

Mean

Suppose we wish to conduct the following two-sided

test at a

5%

level

of

significance:

Ho:

n

=

100

Hi:

ix

#

100

Suppose that a difference of 5 units in the mean score is considered a clinically meaningful difference. If the true mean is less than 95 or greater than 105, we do not want to fail to reject the null hypothesis. How many subjects would be required to ensure that the probability of detecting a 5-unit difference

power

(i.e.,

=

Because we wish to conduct a two-sided

we compute

ate. First,

in

Ho

=

(ixq

use either

jx\

100), the

=

95 or

dard deviation a

=

mean we wish /x\

=

formula (5.19)

test,

to detect

a

when Ho

is

is

80%



9.5.

is

appropri-

mean

false (here

specified

we

could

105, both produce the same result), and the stan-

9.5:

=

i(io5-ioo)i

=0S26

substitute the effect size into formula (5.19):

"=( We

that

the effect size (5.20) by substituting the

ES

We now

we know

0.80)? For this example, suppose

Z 0.9-5 +

Zq.80

X

'

:

0.526

use the standard normal distribution table (Table B.2) to determine

Z080

Z037l

By definition, Z097S is the Z value that holds 0.975 below it (or 0.025 above it, in the upper tail) and Z 80 is the Z value that holds 0.80 below it in the standard normal distribution, shown in the next figure:

and

.

5.2

-

1

fi

Hypothesis Tats About

211

\i

= 0.80

Zr\ an Zr,

Using Table B.2 and the techniques Z0.975

= 196

and

Zq.so

=

0.84.

/ 1.96

"=( A

sample of

size

ensure that power

means.

in

80% H,:

//

If

+ 0.84 V

0.526

5.12

2

=28.33

mean

at least 5 units different

is

will

lead

to rejection

next integer) will 5-point difference

from 100, there

is

an

of the null hypothesis

100.

power of 80% to detect and other scenarios can be investigated by

be required to ensure a

difference of 3 units? This

Power

(5.323)

we always round up to the 80%, p = 0.20) to detect a

substituting into the formulas just

SAS Example

=

(again,

test

we determine

J

0.80 (or

the true

3,

substitute these values:

=

How many subjects would a

described in Chapter

29

chance that the

=

we

We now

in Test of

shown

Hypothesis for

Mean

or by using SAS.

Using SAS

The following output was generated using SAS to determine the sample size required to ensure a specified power (we considered scenarios with 80% and 90% power) and differences in means of 5 and 3 units. A brief interpretation appears after the output.

SAS Output Obs 2

4

alpha 0.05 0.05 0.05 0.05

for

Example 5.14

beta

muO

0.2 0.1 0.2 0.1

100 100 100 100

Interpretation of

There

is

mul 105 105

no SAS procedure

power

es

9.5 9.5 9.5 9.5

0.8 0.9 0.8 0.9

0.52632 0.52632 0.31579 0.31579

103 103

SAS Output

subjects required to detect

sigma

for

specifically .1

specific

n 2 29 38 79

106

n

1

23 31 62 86

Example 5.12 designed to determine the number oi effect

in

the

mean

of

.1

population.

212

Chapter

5 Statistical Inference:

Procedures for n

However, similar to the approach taken in SAS Example 5.7, SAS can be used to program appropriate formulas. Once the formulas are implemented, users can evaluate different scenarios easily. In the example shown here, four scenarios are considered (denoted

Obs

1-4, respectively). Scenario

1

corre-

sponds to the situation presented in Example 5.12. Five variables are input into the program; the level of significance (alpha), the power (power), the mean under the null hypothesis (muO), the mean under the alternative hypothesis (mul), and the standard deviation (sigma). Several variables are created in the program and the values of all variables are printed in the output. A description of the variables and an interpretation of results follows. In scenario 1 (Obs = 1), the level of significance (alpha) is set at 0.05 (5%) and the power was specified at 0.80 (80%); the probability of Type II error, fi, is computed to be 0.20 (20%); the mean under the null hypothesis (muO) was specified as 100; the mean under the alternative hypothesis was specified as 105 (mul); and the standard deviation (sigma) was specified at 9.5. The effect size (es) was computed by dividing the absolute value of the difference in means under the null and alternative hypothesis by the standard deviation. In scenario 1 (Obs =1), the effect size is 0.52632. Twenty-nine subjects (n_2) are required to ensure that the probability of detecting a 5-unit difference in

means

is

80%

(i.e.,

5%. Twenty-three

power

=

0.80), with a two-sided level of significance of

subjects (n_l) are required to ensure that the probability

of detecting a 5-unit difference in significance of

means

80%, with

is

a one-sided level of

5%. (Obs

In scenario 2



2),

we

power

increase the

to 0.90

(90%). Thirty-

eight subjects (n_2) are required to ensure that the probability of detecting a 5-unit difference in

of

5%. Thirty-one

means

90%, with

is

subjects (n_l

)

detecting a 5-unit difference in significance of

means

is

90%, with

a one-sided level of

5%.

In scenario 3

mean under

a two-sided level of significance

are required to ensure that the probability of

(Obs

=

3),

we

input a power of 0.80 (80%) but decrease the

the alternative hypothesis (mul) to 103,

which decreases the

0.31579. Seventy-nine subjects (n_2) are required to ensure that the probability of detecting a 3-unit difference in means is 80%, with a effect size (es) to

two-sided

test

and

5%

level of significance.

Sixty-two subjects (n_l) are

required to ensure that the probability of detecting a 3-unit difference in

means

is

80%, with

In scenario 4

mean under

and 5% level of significance. power of 0.90 (90%) and consider the hypothesis (mul) as 83. One hundred six subjects

a one-sided test

(Obs

=

4)

the alternative

we

input a

(n_2) are required to ensure that the probability of detecting a 3-unit differ-

ence in means

is

90%, with

a two-sided test

and

5%

level of significance.

Eighty-six subjects (n_l) are required to ensure that the probability of detecting a 3-unit difference in significance.

means

is

90%, with

a one-sided test

and

5%

level of

J.

Nonce

that

more

4

Statistical

Computing

subjects are required to ensure a higher power.

subjects are also required to detect a smaller effect si/e. These results

be weighed against practical constraints sample size for the application.

5.3

to

213

More would

determine the most appropriate

Key Formulas Notation/Formula

Application

X± ZlHa/ 2|-p

Confidence interval estimate tor

Description See Table 5.2 for alternate

formulas

//

Z\

I

a/2)0

Find n to estimate n

Sample

(find

si/e to

Z

in

Table

S.l

ensure margin

of error E with confidence level reflected in

Z=

Test statistic tor

See Table 5.3 for hypothesis

5/sA

=

H,: n

Z

testing procedure. Table 5.4 tor critical values of Z, and Table 5.5 for alternate formulas

Find power for of

Find H,:

hi,:

Power =

test

= Ha

=

"

/i,i

p value

_ "

/

Power of two-sided \

for

Z,.

Z|

,„

where ES

=

H

if

test

mean

Z

:

iMi

Sample

'

ES

I

Reject

-Mil

Imp

/'(/>

to test

'/

/
,S00?

;

2l6

;

Chapter

;

;

Procedures for

5 Statistical Inference:

\x

Program Code options ps=62 ls=80;

Formats the output page to 62 lines in

data in; input salary; teststat=salary-29

in length

and 80 columns

width

Beginning of Data Step. Inputs variable salary. .

Creates a

5

new

variable, called teststat, by subtracting 29.5

(//„)

from each salary.

label salary = 'Annual Salary in $000s' cards

Attaches a descriptive label to salary.

Beginning of

32 27

Raw Data

section.

actual observations

31 27 26 26 30 22 25 36 run;

proc means n mean std stderr

t

prt;

Procedure

call.

Proc Means generates summary

continuous variables. Certain statistics

(see

var salary teststat;

statistics for

are requested-

and p values for conducting the

f

tests of hypothesis

Interpretation)

(iii)

Specification of variables.

run;

SAS Example

statistics

End of procedure section.

5.12

Determine the

Number

of Subjects Required to Detect a Specific

Effect Size in a Test of Hypothesis

About

(Example 5.12)

fi

Sample Size Requirements

We

wish to conduct the following Ho:

How many

subjects

ix

=

100

90% power

is

5%

H,:

vs.

would be required

tecting a 5- (or a 3-) unit difference

scenarios with

test at the

n

level of significance:

^

100

to ensure that the probability of de-

80%

(i.e.,

and assume that a

power

=

=

0.80)? Also consider

9.5.

Program Code options ps=62 ls=80;

Formats the output page to 62 lines

80 columns

in

in length

and

width

data in; input alpha power muO mul sigma; z_alpha2=probit (l-alpha/2

Determines the value from the standard normal

z_alphal=probit (1-alpha)

Determines the value from the standard normal

beta=l -power;

Computes beta.

)

Beginning of Data Step. Inputs

5

variables alpha,

power, muo,

mm

and sigma.

distribution with lower-tail area i-alpha/2 (see Z a

distribution with lower-tail area i-alpha (see

.

above)

Z a above)

Computing

5.4 Statistical

Determines the

;-a=probit 1-beta) (

\c means n

alpha

=

(

Description

all

mean std nun max dm;

generates a 100(1

mean

conducts

0.05

./.;

confidence interval tor

/
(Y\>

=

adherence

among

to medication therapy (initiated within the is

measured as the percent of prescribed

month

(e.g.,

100% =

perfect adherence

half ot prescribed doses taken).

of 75 HIV-infected patients

new



all

A random sample

to medication therapy agree to

Each reports their medication regimen and the doses they took over the past month. The mean percent adherence is 78% with a standard deviation of 7.2%. Construct a 95% confidence interval estimate of the mean percent adherence for all HIV-infected patients new to medication therapy. participate in the study.

fc£/j

*\~ 21. The mean lung capacity for nonsmoking males aged 50 is 2 liters. An investigator wants to examine if the mean lung capacities are significantly lower among former smokers of similar backgrounds (i.e., males aged 50 who smoked in the past and are not currently smokers). A random sample of 60 former smokers is selected. Their lung capacities have a mean of 1.8 liters with a standard deviation of 0.2"7

22.

liter.

Among

Run

the appropriate test at a 5".. level of significance.

private universities in the United States, the

students to professors

is

Vs. 2 (i.e.,

mean

ratio of

35.2 students for each professor)

with a standard deviation of 8.8. a.

What

is

the probability that

private universities that the

random sample

in a

mean

of 50

student-to-professor ratio

exceeds sS? b.

random sample of 50 universities is selected and mean student-to-professor ratio is 38. Is there evidence that the reported mean ratio actually exceeds 35.2? Use u =0.05. Suppose

a

the observed

23.

[Tie

recommended

daily allowance

iRDAi

of iron tor adult females

under the age of 51 is 18 nig. We wish to test it females under age are, on average, getting less than IS mg. A random sample ot 4S females between the ages of IN and 50 is selected. The average iron intake

appropriate 24.

It

16.4

mg

with

test at

the

5%

is

a statistical test

is

also be rejected at a

2s.

\

a

5

1

standard deviation ot 4.1 mg. Run the

level ot significance.

performed and //is rejected

at

a.

=

0.01, will

it

= 0.05?

journal article reported that the

particular surgical procedure

in

mean

hospital st.n following a

2001 was

7.1 days.

A researcher

feels

226

Chapter 5

Procedures for n

Statistical Inference:

mean hospital stay in 2002 should be less due to initiatives aimed at reducing health care costs. A random sample of 40 patients undergoing the same surgical procedure in 2002 had a mean length

that the

of stay of 6.85 days with a standard deviation of 7.01 days.

appropriate 26.

A

statistical test at

a

=

Run

the

0.05.

consumer group

is investigating a producer of diet meals to examine prepackaged meals actually contain the advertised 6 ounces of protein in each package. Based on the following data, is there any evidence that the meals do not contain the advertised amount of if its

protein?

27.

Run

the appropriate test at a

5%

level of significance.

5.1

4.9

6.0

5.1

5.7

5.5

4.9

6.1

6.0

5.8

5.2

4.8

4.7

4.2

4.9

5.5

5.6

5.8

6.0

6.1

An

article

HIV have CD4 tests Boston Medical Center is

reported that patients under care for

every 3 months, on average.

A concern

at

is a longer lag between tests. To test the concern, a random sample of 15 patients currently under care for HIV is selected and the time between their two most recent CD4 tests is recorded. The mean time between tests is 3.9 months with a standard deviation of 0.4 month. Run the appropriate test at a 5% level of significance.

that there

A

mean

blood pressure for patients with 125 with a standard deviation of 15. We wish to design a study to evaluate an experimental medication for reducing blood pressure. How many subjects would be required to detect a 10-unit reduction in systolic blood pressure with 80% power? Assume that a two-sided test will be run at a 5% significance level. study reports that the

systolic

a history of cardiovascular disease

29.

We

is

test the hypothesis that the mean weight for females who 140 pounds. Assuming o = 15, using a 5% level of = 150. significance and with n — 36, find the power of the test if

wish to

are 5'8"

is

/j.

(Use a two-sided test of hypothesis.) 30.

31.

We wish to run the following test: Hq: = 100 versus Hi: n # 100 at a — 0.05. If a — 10, how large a sample would be required so that P = 0.04 if /i = 110? In a normal population with a = 5, we wish to test Hq: h = 12 versus Hi: # 12 at a = 0.05. With a sample of 64 subjects, what is the probability of rejecting Ho if n — 14? If \x = 9? /j.

ij.

32.

Results of an industry survey in the computer software field find that the

mean number

of sick days taken by employees

a standard deviation of 2.7 per year.

company

local

is

9.4 per year with

computer software

employees take significantly fewer sick days per year. sample of 15 employees is selected from the local company

feels its

A random

A

5.6

227

Problems

and attendance records are reviewed. The following data represent the numbers of sick days taken by these employees over the past year: 8

10

5

5

4

3

Run 33.

An

9

5

4

15

6

2

the appropriate test at a

analysis

is

15

5%

level of significance.

GRE

conducted to compare the mean

scores

among

seniors in a local university to the national average of 500. Use the

SAS

output shown to address the following questions.

Variable

N

Mean

Std Dev

Std

GRE

250 250

512.0463595 12.0463595

86.2844894 86.2844894

5.4571103 5.4571103

TESTSTAT

a.

Is

the

GRE

mean

score

among

from the national average? b.

Can we

93.835 2.207

34.

An academic medical

all

0.0001 0.0282

the local seniors significantly different

score

among

all

parts of the test).

the local seniors

significantly higher than the national average?

conclusion with data (show

Prob>|T|

T

Justify briefly (show

mean (IRE

say that the

Error

is

Support your

parts of the test).

2002 to was measured on a scale of to 100, with higher scores indicative of more satisfaction. The mean satisfaction score in 2002 was 84.5. Several quality-improvement initiatives were implemented in 2003 and the center surveyed

all

of

its

patients in

assess their satisfaction with medical care. Satisfaction

medical center

is

wondering whether the

A random sample

satisfaction.

initiatives increased patient

of 125 patients seeking medical care in

2003 was surveyed using the same satisfaction measure. Their mean satisfaction score was 89.2 with a standard deviation of 17.4. Is there evidence of a significant improvement in satisfaction? Run the appropriate



test at the

5%

level of significance.

SAS Problems Use SAS 1.

to solve each of the following problems.

\ study is conducted to assess the extent to which patients who had coronary artery bypass surgery were maintaining their prescribed

exercise programs.

The following data

purposes of

this study, exercise

lasting at least

14

11

6

13

20 minutes

12

in

was defined

duration.

S3 14

numbers of tunes month (4 weeks For

reflect the

patients reported exercising over the previous as

1.

moderate physical

the

activit)

228

Chapter

5 Statistical Inference:

Procedures for

Use SAS Proc Means

fx

summary statistics on the numbers of month and a 95% confidence mean number of times patients exercised

to generate

times patients exercised over the previous interval estimate for the

following surgery.

We

wish to design a study to estimate the mean of a population. We wish and to estimate the required sample size for each. Use SAS to determine the sample sizes required for each scenario. Consider margins of error of 5, 10, and 20; confidence levels of 90% and 95%; and standard deviations of 55 and 65 (for a total of 3 x 2 x 2 = 12 scenarios). to consider several scenarios

The following data were

collected

from

a

random sample of 10 asthmatic number of days each

children enrolled in a research study and reflect the child missed school during the past 3 months:

12

6

14

3

2

7

4

10

8

Use SAS Proc Means to generate summary confidence interval for the mean.

statistics

6

and request a

95%

A consumer group its

in

is investigating a producer of diet meals to examine if prepackaged meals actually contain the advertised 6 ounces of protein each package. The group collected the following data:

5.1

4.9

6.0

5.1

5.7

5.5

4.9

6.1

6.0

5.8

5.2

4.8

4.7

4.2

4.9

5.5

5.6

5.8

6.0

6.1

Use SAS Proc Means to generate summary statistics on the ounces of protein contained in the packaged meals. In addition, run a test to determine if there is any evidence that the meals do not contain the advertised

amount

of protein.

Run

the appropriate test at a

5%

level of

significance.

wish to design a study to test the following hypotheses: Ho: ^ = 100 # 100. We wish to consider several scenarios and to estimate the required sample size for each. Use SAS to determine the sample sizes required for each scenario to ensure power = 80%. Consider means under the alternative hypothesis of 90, 95, and 120; levels of significance of 0.05 and 0.01; and standard deviations of 7 and 10 (for a total of 3 x 2 x 2 = 12 scenarios).

We

versus H\:

/j.

Results of an industry survey in the computer software field finds that the

mean number

of sick days taken by employees

a standard deviation of 2.7 per year.

A

local

is

9.4 per year with

computer software company

employees take significantly fewer sick days per year. A random sample of 15 employees is selected from the local company and feels its

attendance records are reviewed. The following data represent the

229

5.6 Problems

numbers of 8

10

5

5

4

3

days taken by these employees over the past year:

sick

6

2

9

5

4

15

15

Use SAS Proc Means to generate summary sick days taken is

statistics

by employees. In addition, run a

on the numbers of

test to

any evidence that the employees take fewer than 9.4

Run

the appropriate test at a

performs a two-sided conclusion.)

test;

5%

make

level of significance.

the adjustment to

determine sick

if

there

days per

year.

(Note:

draw your

SAS

Descriptive Statistics (Ch. 2)

Probability (Ch. 3)

Sampling Distributions (Ch. 4)

Statistical

Inference

(Chapters 5-13)

Continuous



Continuous

Dichotomous

Estimate u:

(2

groups)

Compare Independent Means

Discrete (> 2 groups)

-

/',))

or the

Mean

Test the Equality of k

Variance

Continuous

to

u.

Known, 5/12

(/',

Continuous

Compare

Historical Value

(j/,

=

(Estimate/Test

Means using Analysis of yu

2

=







=

/i

9/12

t)

Estimate Correlation or

Continuous

,,

Difference*/',,)

10/12

Determine Regression Equation Continuous

Several Continuous or

Multiple Linear Regression Analysis

Dichotomous Estimate

Dichotomous

p;

Compare p

to

Known,

Historical Value

Dichotomous

Dichotomous

(2

groups)

Compare Independent Proportions (Estimate/Test

Dichotomous

Discrete (> 2 groups)

(p,



p 2 ))

Test the Equality of k Proportions

(Chi-Square Test)

Dichotomous

Several Continuous

Multiple Logistic Regression Analysis

or Dichotomous

Discrete

Compare

Distributions

Among

Discrete

(Chi-Square Test) Several Continuous

Time

to Event

Survival Analysis

or Dichotomous

k Populations

7/8

Statistical

Inference:

Procedures for 6.1

Statistical Inference

6.2

Power and Sample

6.3

Key Formulas

6.4

Statistical

6.5

Analysis of

6.6

Problems

(/1,-jO

Concerning Size

(/i)



fii)

Determination

Computing

Framingham Heart Study Data

*3'

232

Chapter 6

Statistic\il Inference:

We now

Procedures for

(

//

1

- M2)

describe statistical inference procedures

son groups and the outcome of interest

we compare means between

is

when

groups. In Chapter 5

we

ence procedures for a single sample (estimation of an of hypothesis about the

mean

there are

two compari-

a continuous variable. In such cases,

described statistical infer-

unknown mean and

tests

Two

sample applications are extremely common. For example, recall Example 5.9 in which a diet was evaluated for its ability to reduce cholesterol levels. In the example, a single sample of 12 individuals followed the diet for 3 months. At the end of the 3-month observation period, cholesterol levels were measured and compared against a known (or historical) value. In that example, we did not find statistically significant of the population).

evidence of a reduction in cholesterol attributable to the

diet.

We made

the as-

sumption that the mean cholesterol level for males aged 50 not following the diet was 241, and we observed a mean cholesterol level in our sample of 235. The assumption that the mean cholesterol level would be 241 in persons not following the diet may or may not have been a valid assumption. We could have used different study designs that might have given a better assessment of the impact of the diet on cholesterol. For example, we could have used a concurrent comparison group (instead of a historical comparison). This type of study involves selecting a group of individuals appropriate for the study (e.g., males aged 50) and randomly assigning them to one of two groups. (Later

we

will describe in detail the procedures for assigning individ-

comparison groups.) One group of individuals follows the comparison group does not. At the end of 3 months, we compare the mean cholesterol levels between comparison groups. If the groups are similar (and this is related to random assignment), except that one group followed the diet and the other did not, then differences in cholesterol can be attributed to the diet. This is an example of what we call a two independent samples procedure, in which the two comparison groups are physically distinct (i.e., they are comprised of different individuals). Another study design for the assessment of the effect of the diet on cholesterol involves the 12 subjects, but before starting the diet, we measure their initial (sometimes called baseline) cholesterol levels. After each individual follows the diet for 3 months, we then measure a final cholesterol level. The focus in this design is how much each individual changes over time. If individuals' cholesterol levels drop from where uals at

random

to

diet while the

they started, is

we conclude

that the diet

an example of what we

call a

each individual serves as his or her In this chapter,

we

is

effective in reducing cholesterol. This

two dependent samples procedure,

own

in

which

control.

two independent and two dependent samThe most appropriate design for a specific applicavariety of factors, including the treatment and outcome and characteristics of the study subjects. These details will will describe

ples procedures in detail.

tion depends

on

a

under investigation be discussed in subsequent chapters.

The techniques described here are concerned with the difference between! two means. The techniques for estimating the difference between two means

Procedures for

Statistical Inference:

as well as the techniques tor testing if

one

two means

if

-//;i

233

are significantly different (or

larger than the other) are identical in principle to the techniques de-

is

scribed in Chapter 5, which were concerned with the lation

(fi\

fi.

The assumptions necessary

mean

of a single popu-

tor valid applications of the techniques

and formulas that follow are 1.

random samples from

2.

large samples

(«,

>

the populations under consideration

30, where

/

=

1,

2) or

normal populations

two independent samples procedures,

In

difference in population means:

(/ij



//;>).

the parameter of interest is the Confidence intervals in two inde-

pendent samples applications are concerned with estimating (n\ difference in means, as

opposed

mean

to the value of either

as

was

— n±), the the case in

The same is true in tests of hypotheses in and research or alternative hypothesis arc

the one-sample estimation problems. the two-sample case. Both the null

H

(no difconcerned with the difference in means, for example, fi\ — fa = (means are different). In two depenference in means) versus H,: \i — jUi 7^ dent samples procedures, the parameter of interest is the mean difference: /c

1

persons following

6.1 Procedures

a

a

comparison of the mean cholesterol

special diet as

compared

to persons taking

Concerning^ 2

Iwo independent populations and —

(2.67

2.67 2

r

H

r

i

M

+ 1)* 2

+ T

fl

0.62

l

9

Using the r distribution table (Table B.3), the two-sided 22 degrees of freedom is t = 2.074. The decision rule is Reject

Do 4.

()

if t

> 2.074 or

H

if

if t

value with

< -2.074

-2.074


2.074.

30 minutes of aerobic exercise ro females

We now followed by populations.

a

test

have significant evidence, a

mean

=

0.05,

heart rates following

aged 20-24 years as compared example, p = 0.02.

for females

aged 25-30 years. For illustrate

We

a difference in the

this

the preliminary test for

of hypothesis concerning

homogeneity of variances

means of two independent

25O

Chapter 6

Example 6.6

Statistical Inference:

Procedures for

(

\i\

Mean

Testing Difference in



M2)

Public Health Awareness Scores Between

Males and Females

Random

samples of 11 male high school students and 12 female high school

students are selected within a particular school district for an investigation. Students' scores

on

(PH) awareness

a public health

The test is scored on of more awareness.

descriptive statistics follow.

higher scores indicative

recorded; the

0-1000, with

Females

Males

Statistic

Sample

test are

a scale of

11

12

560.0

554.2

size

Mean PH

awareness score

Standard deviation

in

PH

awareness scores

129.4

133.1

Test if the male students score significantly higher than the female students on the public health awareness test within this school district using a 5% level

of significance. 1.

Set

up hypotheses.

H where 2.

ix\

= mean

Ml

=

M2

Hi: Hi

>

Hi,

:

a

=

score for males and hi

0.05

= mean

score for females

Select the appropriate test statistic.

In order to determine

if

this application falls into

Case 2 or Case 3, must be

a preliminary test of the equality of population variances

conducted. 1.

2.

Set

up hypotheses. tio:

u

CTj

=

er,

H:

a~

^

a^,

2

1

a



0.05

Select the appropriate test statistic. 2

S

F 3.

=

l

4

Decision rule. dfj

df2

F

= —1= =m—1= ti\

.975(10,ll)

=

Reject

Do

11



1

=

12



1

=

10

(numerator degrees of freedom)

1 1

(denominator degrees of freedom)

3.53 and F

H

if

.9- 5

(

F < 1/3.72

not reject

H

if

11,10)

=

=

3.72

0.269 or

if

0.269 < F < 3.53

F > 3.53

6.1 Statistical Inference

4.

Concerning (n j—

251

/ii|

Test statistic. s,

1133.11-

Conclusion. not reject H. since 0.269 < 1.06 < 3.53. We do not have show that a~ ^ er2 Therefore, for the purposes

Do

significant evidence to

of this test of means,

equal

=

o~

(i.e.,

.

we assume

a.) and apply the test

X,

df=n +n2 -2 = 1

3.

that the population variances are statistic

given under Case 2:

-X

2

ll

+ 12-2 = 21

Decision rule.

H,

Reject

Do

> 1.721

iff

Hu

not reject

if f

< 1.721

Test statistic.

We

first

compute S r

(«i

-

l)sj

~\

P

"1

+{n2 -

11( 129.4)-

11-12-2

V

Now

-

+

10(133. 1) 2

~

TK: The two sample robust

i.e..

of variances)

homogeneity ample,

it

t

tests

(

m

and Case 3) concerning [fi\ are generally assumptions such as normality and/or equality are equal (i.e., >u, - »2 ). However, the / test for

ase 2

\

insensitive to violations in

when t

the sample sizes

variances

is

sensitive to

the analytic variable

actual level of significance, u,

is

\

iolations in the normality assumption.

not normally distributed and the

may exceed

the specified level

e.g..

/'

test

is

1

or ex-

applied, the

Chapter 6

252

Statistical Inference:

Procedures for n\ (

- m)

SAS Example 6.6 Testing Difference in Mean Public Health Awareness Scores Between Males and Females Using SAS The following output was generated using SAS Proc Ttest, which conducts a two independent samples test of hypothesis. The same procedure automatically

produces a preliminary

test

of the homogeneity of variances.

A

brief in-

terpretation appears after the output.

SAS Output

Variable test test test

(

Statistics Upper CL Class Std Dev Std Err 219.77 female 37.365 male 233.61 40.136 187.5 54.767 Diff 1-2

Variable test test test

test test

Example 6.6

The TTEST Procedure Statistics Lower CL Upper CL N Mean Mean Mean 12 471.93 554.17 636.41 470.57 11 560 649.43 -119.7 -5.833 108.06

Class female male Diff 1-2)

Variable

for

Lower CL Std Dev Std Dev 91.692 129.44 93.011 133.12 100.94 131.2

Minimum

Maximum

260 370

750 770

(

Method Pooled Satterthwaite

Variable

Method Folded

test

T-Tests Variances Equal Unequal

21

Value -0.11

20.7

-0.11

DF

Equality of Variances Den DF Num DF 10

F

Interpretation of

for

Pr >

It

I

0.9162 0.9163

F Value

11

SAS Output

In the top section of the output,

t

1.06

Pr > F 0.9215

Example 6.6

SAS provides summary

statistics

on the an-

each comparison group (females and males) and then for the differences in means (females - males). The summary statistics include the sample sizes, the sample mean (Mean), and 95% confidence alytic variable (test score) for

means of each group and for the difference CI for the mean are labeled "lower CL mean" and "upper CL mean"), standard deviations, and 95% confidence intervals for the population standard deviations of each group and for the differences (the intervals (CI) for the population

in

means

(the limits of the

6.1 Statistical Inference

(fiy

"Upper CL maximums.

and

Dev"),

Std

standard

SAS

253

Std De\

"

minimums, and

(s/y/n),

SAS performs the test of hypothesis for two different tests, one in which

next section of the output,

equality of means.

errors

-/o)

CL

CI tor the standard deviation are labeled "lower

limits of the

In the

Concerning

actually carries out

assumed to be equal (we called this Case 2) and which the population variances are assumed to be unequal (we called this Case 3). SAS uses the formulas we summarized in Tables 6.3 and 6.5 for equal and unequal variances, respectively. The values of the test statistics appear under the column headed "t Value," and just before these SAS displays the degrees of freedom. Again, these are computed using the formulas from Tables 6.3 and 6.5. Finally, SAS provides two-sided p values (assuming that the population variances are

one

in

is Fi\\ /j.\ ^ Hi). user must decide which analysis (equal or unequal variances)

the alternative hypothesis

The

To

appropriate.

aid in this decision,

H

SAS provides

=

is

most

a preliminary test of the

^ a'). The results SAS Ttest output in the sec"Equality of Variances." SAS provides an F statistic (computed by

homogeneity of variances

(i.e.,

a~

:

()

a~ versus H\\ o~

of the preliminary test appear at the bottom of the tion titled

It computes F by dividing the larger sample variance by the smaller, regardless of the group (1 or 2) designation. Therefore, the F statistic produced by SAS is always greater than or equal to 1. In this case, since the sample variance among males is larger:

taking the ratio of the sample variances).

F

=

133.1

(

2 )

given by df,

129.4) :

/|

=

1.06.

The degrees of freedom associated with Fare

11 - 1 = 10 and df2 = n2 - 1 = 12 - 1 = 11. For SAS produces the probability of observing a value of F

= »i - 1 =

the preliminary

test,

more extreme than where p value

is

the

draw

p value, 0.9215,

is

test. In this

case,

ot r

is

if

);

the follow-

p value
F

the value of the test statistic (denoted Pr

ing rule should be applied to

and

this

example

is

considered an example of

ase 2.

Because the preliminary

Case 2

(i.e.,

test

equal variances),

the test statistic

is /

= — 0.

1

1

we

suggested that this example

is

an example of

look at the output for the equal variances case;

with 2

I

degrees of freedom (n\

+

ni

-

2).

For the

SAS produces a two-sided p value. In this example, the p value is 0.9162. Because we are interested in a one-sided test, the following rule should be applied: Rejeci Ho if f> value/2) < or. In this example, we do not reject //., since p value 2) = >()M\hl 2) = 0.4581 > 0.05. We do not have significant evidence to show that male students score higher than female students iin the main

test,

I

1

w ith 111 this school district. Note: The test statistic from that computed in xample 6.6 due to the fad that SAS orders the groups alphabetically Mt^\ calls females group I.) public health awareness

produced

In

SAS

test

differs

(

I

254

Chapter 6

Example 6.7

Procedures for (n\ -

Statistical Inference:

m)

Estimating Difference in Mean Number of Emergency Room Visits Between Children 5 and Under and 6-10 Years of Age Suppose we wish to estimate the difference in the mean numbers of emergency room (ER) visits in 12 months among children with asthma age 5 and under as

compared

to children aged 6-10. For the purposes of this investigation, our

analyses are restricted to children tions

(i.e.,

who

are free

from any other chronic condi-

The following data are collected on and under and 50 children aged 6-10:

they suffer from asthma alone).

random samples

of 65 children age 5

Mean Number




1.70.

We

have significant evidence to show

Therefore, for the purposes of this test of means,

we

b.

apply the

test statistic

Statistical Inference

I

257

2

5j

S,

H\

«2

Decision rule.

Using Table B.2B, the decision rule Reject Hoif

Do

Z>

is

1.960 or

not reject M)

if

if

- 1 .960

Z < -1.960 < Z < .960 1

Test statistic.

-

X,

Z=

s

i

x^ 5

3.4

-

4.5

j4~5 i '

5.

— m)

-X

X,

\

4.

{ti\

given under Case 3:

Z=

3.

Concerning

50

+

L~9

-1.10

=

-3.15

0.349

60

Conclusion. Reject Ho since —3.15

a



0.05, to

show

< —1.960. We have

that there

is

significant evidence,

a difference in the

mean number

the student health center between university freshmen

For this

test,

p

=

of visits to

and sophomores.

0.001 (see Table B.2B).

SAS Example 6.8 Testing Difference in Mean Number of Visits to Health Center Between Freshmen and Sophomores Using SAS

The following output was generated using SAS Proc

Ttest.

A

brief interpreta-

tion appears after the output.

SAS Output

;ble :s "S is

Class freshman sophomore Diff (1-2)

for

Example 6.8

The TTEST Procedure Statistics Lower CL Upper CL Lower CL Up] Mean Mean Mean Std Dev 2.8046 3.4071 1.7709 50 4.0096 4.1457 4.5059 60 4.8661 1.1819 .7 67 -1.099 -0.43 1.5543

Std Dev 2.12 1.3944 1.761

1

258

Chapter 6

Statistical Inference:

Variable visits visits visits

Variable visits visits

Variable visits

Procedures for (n\

Class freshman sophomore Diff (1-2)

Method Pooled Satterthwaite

Method Folded

- m)

Statistics Upper CL Std Dev Std Err 2.6418 0.2998 1.7007 0.18 2.0318 0.3372

T-Tests Variances Equal Unequal

DF 108 81.9

Minimum -1.548 1.4388

t

Equality of Variances Num DF Den DF 49

F

Interpretation of

Value

for

Example

8.0719

Pr >

1

1

-3.26 -3.14

0.0015 0.0023

F Value

Pr > F

2.31

0.0022

59

SAS Output

Maximum 7.7675

6.8

SAS provides summary

statistics on the anaeach comparison group (freshmen and sophomores) and then for the differences in means (freshmen — sophomores). The summary statistics include the sample sizes, the sample mean (Mean),

In the top section of the output,

lytic variable

and and

95%

(Number of

Visits) for

confidence intervals (CI) for the population means of each group

means (the limits of the CI for the mean are labeled "upper CL mean"), standard deviations, and 95% confidence intervals for the population standard deviations of each group and for the differences (the limits of the CI for the standard deviation are labeled "lower CL Std Dev" and "Upper CL Std Dev"), standard errors (s/y/n), minimums, and maximums. In the next section of the output, SAS performs the test of hypothesis for equality of means. SAS actually carries out two different tests, one in which the population variances are assumed to be equal (we called this Case 2) and one in which the population variances are assumed to be unequal (we called this Case 3). SAS uses the formulas we summarized in Tables 6.3 and 6.5 for equal and unequal variances, respectively. The values of the test statistics appear under the column headed "t Value," and just before these SAS displays the degrees of freedom. Again, these are computed using the formulas from Tables 6.3 and 6.5. Finally, SAS provides two-sided p values (assuming that for the difference in

"lower

CL mean" and

the alternative hypothesis

is

Hi:

ji\

^

1x2).

The user must decide which situation (equal or unequal variances) is most appropriate. To aid in this decision, SAS provides a preliminary test of

6.1 Statistical Inference

the homogeneity of variances

(i.e.,

Hr- o~

—o1

Concerning (pi

versus H\:

— m)

cr,

#o\

259

).

The

bottom of the SAS Ttest output in the section titled "Equality of Variances." SAS provides an F statistic (computed by taking the ratio of the sample variances). It computes F by dividing the larger sample variance by the smaller, regardless of the group (1 or 2) designation. Therefore, the F statistic produced by SAS is always greater than or equal to 1. In this case, since the sample variance among freshmen is larger: F = (2.12) 2 /(1.39) 2 = 2.31. The degrees of freedom associated with F are given by: df, = ti\ — 1 = 50 — 1 = 49 and df2 = m — = 60 — = 59. For the preliminary test, SAS produces the probability of observing a value of F more extreme than the value of the test statistic results of the preliminary test

appear

at the

1

1

(denoted Pr > F)\ the following rule should be applied to draw a conclu-

p value


the test statistic.

x^

=

sj

Z < -1.96

if

Z
h

=

62.72

6\5

Thus, n\

=

rii

=

63 subjects (126

SAS Example 6.14 Sample Size Determination Using SAS

total) are

needed.

Means

in Tests for Differences in

The following output was generated using SAS to determine the sample sizes (per group) required to ensure a specified power (we considered scenarios with 80% and 90% power) for differences in means of 0.4 and 0.3 units with a standard deviation of 0.3

units.

A brief interpretation appears after the

output.

SAS Output OBS

ALPHA BETA

1

05

2

05

1

3

05

2

4

05

2

Z

1

for

ALPHA2 95996 1 95996 1 95996 1 95996 1

Example 6.14 Z 1

1

Interpretation of

There

is

BETA 28155 28155 84162 84162

MU2

8

4

3

9

1

8

5

3

9

1

8

4

3

8

1

8

5

3

8

1

SAS Output

no SAS procedure

POWER

MU1

for

SIGMA

specifically designed to

we

N

SAS

to

22 9

16

determine the number of

means of two indepen-

did in Chapter 5 to determine the sample size re-

quired to detect a specific effect size in the one-sample test of hypothesis, use

2

12

Example 6.14

subjects required to detect a specific difference in the

dent populations. But as

ES 33333 00000 33333 00000

program appropriate formulas. Once

we

the formulas are imple-

mented, users can evaluate different scenarios easily. In the output shown, four scenarios are considered (denoted OBS 1—4). Scenario 1 corresponds to

Example

the situation presented in

6.14. Five variables are input into the

program, the level of significance (alpha), the power (power), the mean for group 1 (mul), the mean for group 2 (mu2), and the standard deviation (sigma). Several variables are created in the program and the values of all variables are printed in the output. A description of the variables and an interpretation of results follows. In scenario

1

(OBS

=

the probability of Type in

group

1

1 ),

II

the level of significance (alpha)

error,

ft, is

(mul) was specified as

computed

0.8; the

is

mean

in

set at

0.05 (5%);

(10%); the mean group 2 was specified as

to be 0.10

6.

0.4 (mu2); the standard deviation (sigma)

was

standard deviation. In scenario jects (n_2) are

means

5%.

of significance of

The

=

1 ),

and the power was computed by

(es)

means between groups by

the effect size

is

1.33.

the

Twelve sub-

required per group to ensure that the probability of detecting a

0.4 unit difference in

to 0.5.

(OBS

1

271

specified at 0.3;

(power) was specified at 0.90 (90%). The effect size dividing the absolute value of the difference in

Key Formulas

>

effect size

is

90%

In scenario 2 is

reduced to

(i.e., power = 0.90), with a two-sided level (OBS = 2), we change the mean in group 2 1.00, and a total of 22 subjects are required

per group to ensure that the probability of detecting a 0.3 unit difference in

means

is

90%

power = 0.90), with a two-sided level of significance of 5%. and 4 (OBS = 3 and 4), we consider the same scenarios and

(i.e.,

In scenarios 3

reduce the power to 0.80 (80%). The result is that fewer subjects are required. Nine and sixteen subjects are required per group, respectively, to ensure that the probability of detecting a 0.4 and 0.3 unit difference in means is 80%,

with a two-sided

level

of significance of

5%.

Key Formulas

6.3

Application

Notation/Formula

Confidence interval estimate for

(/

))

>



po(l



ri .

j

p(\



p)

\-

is

distri-

given here.

Estimating Proportion of Patients with Osteoarthritis Consider the data from Example 7.1 and compute a

95%

confidence interval

for the proportion of all patients in the physician's practice with diagnosed osteoarthritis.

The appropriate formula

P

±

is

given in Table 7.1:

;m-P) Zl-(a/2)

Substituting the sample data and the appropriate value from Table

95%

B.2A

for

confidence:

0.19±

,0.19(1 -0.19) 1.96.

200 0.19

±

0.19

±0.0549

1.96(0.028)

(0.135,0.245)

Thus,

we

are

95%

cian's practice

confident that the true proportion of patients in this physi-

with diagnosed osteoarthritis

is

between 13.5% and 24.5%.

7. 1

SAS Example

7.2

Statistical Inference

Concerning p m

297

Estimating Proportion of Patients with Osteoarthritis Using SAS

The following output was generated using SAS Proc

Freq, which generates a frequency distribution table for a categorical (or ordinal) variable. In this ex-

ample, tis

we

record whether each subject has been diagnosed with osteoarthri-

(or not).

The usual convention

to assign scores of

is

diagnosis of osteoarthritis) and scores of tis).

The input data

to failures

(i.e.,

1

to successes

(i.e.,

free of osteoarthri-

consists of designations (0 or 1) for each subject.

A

brief

interpretation appears after the output.

SAS Output

for

Example

7.2

The FREQ Procedure

x

Frequency

Cumulative Frequency

Percent 81.00 19.00

162 38

Proportion ASE 95% Lower Conf Limit 95% Upper Conf Limit

ASE under HO

One-sided Pr Two-sided Pr

0.0354
5 • given in Table 7.1:

P-Po

Z=

Po(l-Po)

3.

Decision rule (see Table B.2B

in the

Appendix

for the appropriate critical

value).

Reject

Do

H

if

Z < -1.960

not reject

H

if

or

if

-1.960
Z
—1.645.

show

We

do not have

significant

a reduction in the proportion of patients seen

in the clinic for flu after receiving the vaccine.

Example 7.4 brings up an important

issue of clinical versus statistical sig-

nificance. In the formal test of hypothesis,

we

failed to reach statistical signif-

However, we may have committed a Type II error (e.g., a larger sam pie size may be required to detect an effect). In any statistical application it is extremely important to look at the direction and magnitude of the observed icance.

effect. In

Example

7.4, there

is

a reduction in the proportion of flu cases seen

is 0.12, or 12%. Our test did not was statistically significantly lower than 15%; however, there is a reduction and it should be evaluated carefully. Is this reduction clinically important? On a different note, was the study design we used optimal

following the vaccines. The point estimate indicate that this

to address the question of effectiveness of the flu shots in pediatric patients?

A

concurrent comparison group might have provided a better comparison than

,

labia

7.2 Cross-Tabulation

7.2

historical data

(i.e.,

parison group

in the

p

(l

=

We

0.15).

will discuss tests

301

with a concurrent com-

following sections.

Cross-Tabulation Tables In

applications involving discrete variables, cross-tabulation tables are often

constructed to display the data. Cross-tabulation tables are also called

("R by C")

tables,

where R denotes the number of rows

denotes the number of columns.

A

2 x 2 table

is

illustrated in

Two

Cross-Tabulation to Summarize Proportions in

Example 7.5

A

in the table

RxC and

Example

C

7.5.

Populations

conducted to evaluate the long-term complications in diabetic patients treated under two competing treatment regimens. Complications are measured by incidence of foot disease, eye disease, or carlongitudinal study

is

diovascular disease within a 10-year observation period.

The following

2x2

cross-tabulation table summarizes the data:

Long-Term Complications Treatment Treatment

1

Treatment 2 Total

The estimate

Yes

No

12

88

Total

100

8

92

100

20

180

200

of the population proportion of

complications under treatment

1

(p,)ispi

=

patients

all

12/100

=

who

develop

0.12, by (7.2). This

is

equivalent to the estimate of the probability that a single patient develops

complications under treatment 1. The estimate of the probability that a single patient develops complications under treatment 2 is fc = 8/100 = 0.08.

The interest

probability oi success or is

the

outcome

Example

(in

development of complications!

is

7.5, the

outcome

of

often called the risk of out-

come. There are a number of statistics used to compare risks of outcomes between populations (or between treatments). These statistics .ire called effect measures and are described in detail 111 ( hapter 8.

SAS Example

7.5

Generating I

(

ross-Tabulations Using

SAS

he following output was generated using

SAS

Proc

I

req,

which generates

a

contingency table (or cross-tabulation) when two variables are specified. A brief interpretation appears after the output.

302

Chapter 7 Categorical Data

SAS Output

for

Example

7.5

The FREQ Procedure

Table of trt by compl Frequency Percent Row Pet compl Col Pet z_no lyes 1

1

1

trt_l

12

I

1

1

trt

1

1

8

1

1

1

Total

1

I

180 90.00

Sample Size

SAS generates

a contingency table

in top left corner of table).

and the Percent

is

the percent of

Row

Percent

is

50.00

1

1

1

200 100.00

7.5 cell

of the table displays the

and the Column Percent

all

reflect

is

the

number of

subjects in each

left cell (i.e.,

had complications). These patients

The

each

in

The Frequency

there are 12 patients in the top

0.06).

and

100

I

200

Example

Row Percent,

Frequency, the Percent, the

cell,

for

1

92

20

SAS Output

1

46.00 92.00 51.11

1

10.00

Interpretation of

50.00

1

+

1

4.00 8.00 40.00

1

100

1

44.00 88.00 48.89

1

+

+

trt_2

88

1

6.00 12.00 60.00

1

Total

1

1

subjects

6%

(see legend

subjects in each

cell.

For example,

on treatment

1

who

also

of the total sample (12/200

the percent of subjects in the particular



row

fall in that cell. For example, there are 100 patients on treatment 1, or 100 patients in the first row of the contingency table. The 12 patients who had complications reflect 12% of all patients on treatment 1 (12/100 = 0.12). The Column Percent is the percent of subjects in the particular column who fall in that cell. For example, there are 20 patients who report complications. These 20 patients appear in the first column of the contingency table. The 12 patients in treatment 1 who had complications reflect 60% of all patients who had complications (12/20 = 0.60). The row total and column total (called the marginal totals) are displayed to the right and at the bottom of the contingency table, respectively. Both row and column frequencies and

that

percents (of total) are displayed.

and

7.3 Diagnostic Tests: Sensitivity

7.3

303

Specificity

Diagnostic Tests: Sensitivity and Specificity A

diagnostic test

is

that

is

outcomes or events that are not

a tool used to detect

may have

For example, an individual

rectly observable.

A

not directly observable by a physician.

di-

a condition or disease

diagnostic test designed to

detect such a condition can be used as a tool to assist the physician in detection. Desirable properties in diagnostic tests include the following:

Example 7.6

an event when the event

The diagnostic

test will indicate

The diagnostic

test will indicate a

is

present,

nonevent when the event

and

absent.

is

Estimating Sensitivity and Specificity

A

clinical trial

detect

is

conducted to evaluate a diagnostic screening

chromosomal

fetal

abnormalities.

designed to

abnormalities

The diagnostic

sample of 200 pregnant women,

The following

test

fetal

test is performed on a ranundergo an amniocentesis. cross-tabulation table summarizes the data:

are confirmed using amniocentesis.

dom

Chromosomal

2x2

who

later

Diagnostic Test

Amniocentesis

Abnormal

14

(Disease)

Normal 1N0

Total

Negative

Posit ii e

20

6

64

116

180

78

122

200

Disease)

Total

Based on amniocentesis, the estimate of the population proportion of all carrving fetuses with

by

chromosomal abnormalities

(p) is

p

=

20/200

women

=

0.10,

(7.2).

The following of the

test,

used to describe diagnostic

statistics are

tests:

the sensitivity

the specificity of the test, the predictive value positive

(PV + and )

the predictive value negative (PV~). These statistics are defined as follows:

Sensitivity Specificity

= =

P(Positive test

I

P( Negative test

Disease) I

No

(7.5)

disease)

= P(Disease/Positive test) negative = P(No disease/Negative

Predictive value positive Predictive value

In

I

xample

7.6, the estimate

estimate of the specificity

is

t I

the sensitivity

In

180

=

0.64.

the

t I

In-

rest is

test)

14/20

=

0.70.

The

estimate ol the predictive

— 304

Chapter 7 Categorical Data

is PV* = 14/78 = PV" = 116/122 = 0.95.

0.18, and the estimate of the predictive nega-

value positive tive

is

In most cases, higher sensitivities and higher There are instances, however, where a better test

specificities are desirable.

determined by only one

is

criterion (e.g., higher sensitivity).

SAS Example 7.6 Estimating Sensitivity and Specificity Using SAS The following output was generated using SAS Proc Freq, which generates a contingency table (or cross-tabulation) when two variables are specified. SAS does not produce the estimates of table.

The

statistics

SAS Output

for

sensitivity, specificity, false positive rate,

and

but these can be extracted from the contingency

false negative rate directly,

of interest are described after the output.

Example 7.6

The FREQ Procedure Table of amnio by diagtest amnio diagtest Frequency Percent Row Pet Total Col Pet positive negative I

I

I

I

abnormal

14

1

1

1

1

I

I

7

6

1

.00

3.00 30.00 4.92

1

70 .00 17 95

1

1

20

1

1

10.00

!

1

H

normal

64

1

32 .00

1

1

1

1

1

35 56 82 05

116 58.00 64.44 95.08

1

1

1

I

18 C

90.00

1

1

H

Total

Interpretation of

The

sensitivity

is

test as positive,

of the table.

The

78

122

39

00

61.00

SAS Output

for

the proportion of

14/20

=

is

Example 7.6

abnormal cases correctly

0.70. This

specificity

200 100.

is

the

Row

classified

Percent in the top

by the

left cell

the proportion of normal cases that are correctly

by the test as negative, 116/180 = 0.644. This is the Row Percent of the bottom right cell of the table. The predictive positive value is the classified

- p:)

7.4 Statistical Inference Concerning (p,

proportion of normal cases classified as positive that are, This

the

is

Column

value Negative

in fact diseased,

of the table.

left cell

The

305

PV+

.

predictive

the proportion of cases classified as Negative that are, in fact,

is

=

normal, PV"

Percent of the top

m

=

116/122

0.95. This

is

the

Column

Percent of the bottom

right cell of the table.

7.4 Statistical We

Inference Concerning

often



compare two independent populations with

tion of successes in each. flu

(p,

A

p2 )

respect to the propor-

better study design to evaluate the effectiveness of

would involve two comparison and the other would receive a why is this important?). The analysis

shots in pediatric patients (Example 7.4)

groups.

One group would

placebo shot

(to

receive the flu shots

maintain blinding



would then compare the groups with respect to the proportions of children who developed flu. In the two independent samples situation, one parameter of interest is the difference in proportions, the risk difference: (p, — p : ), where p, = the proportion of successes in population 1 and p 2 = the proportion of successes in population 2.

The point estimate proportions,

is

-Pi

P.

where If

=

p,

independent

for the risk difference, or difference in

given by

=

the sample proportion in population i(i

1,2)

samples from both populations are sufficiently large (see criteria in in Table 7.2 can be

Table 7.2), then the confidence interval formula shown

used to estimate

(p,

—p

:

).

Table B.2A contains the values from the standard normal distribution for

commonly used confidence levels. When the sample sizes are adequate (i.e., if and only if mini;/) pi, >i]{ 1 - £1)) > 5 andram(nip\, nA — p\)) > 5), the con1

fidence interval formula given in Table 7.2

is

appropriate.

If either

(or both)

samon

ple size(s) are not adequate, alternative formulas are available that are based

the binomial distribution

Table 7.2

and not the normal approximation given

Confidence Interval for

- p:

(p,

)

Confidence Interval

Attributes

Simple random samples from binomial populations Independent populations I

here.

i

/>,

-

f> :

)

±

//fc(l-£i>\ ,,:./!

/,

I

"'

\

+

/fc(l-fc>\ I

I

"-

'

arge samples:

mini

'(|/'i-

mini':

'

5

*5

and

where

ft

-

Xj

«j

and

/> :

=

\

306

Chapter 7 Categorical Data

Example 7.7

Estimating Difference in Proportions of Children

Emergency

Room

Who

Use the

Between Treatments

We want to evaluate the effectiveness of a new treatment for asthma. The new treatment

is

administered

treatment administered

in

an inhaler and will be compared to a standard same way. Because asthma is a serious condi-

in the

would be unethical

to use a placebo comparator in this trial. Suppose emergency room (ER) use for complications of asthma during a 6-month follow-up period. A random sample of 375 asthmatic children are selected from a registry, of which 250 are randomized to the new treatment group and 125 are randomized to the comparison group (standard treatment). Both groups are provided instruction on the proper use of their inhalers. This allocation scheme is called 2-to-l, where twice as many participants are randomized to the investigational treatment as the control. Both groups of children are followed for 6 months and monitored for ER use. Of the children on the new treatment, 60 used the ER during the 6 months for complications of asthma, and 19 of the children on the standard treatment used the ER for complications of asthma during the same period. Construct a tion,

it

our outcome variable

95%

is

confidence interval for the difference in the proportions of asthmatic

new and standard treatments who used 6-month follow-up period. The data lavout is as follows:

children on the

n = 250

Asthma Registry

New

treatment

the

ER

during the

X=

60 use ER

X=

19 use


5 • and

min(« 2 p2,«2(l

-

Pi))

=min(125(0.15), 125(1 -0.15))

=

min(19, 106)

=

19

>

5



.4 Statistical

The formula from Table

7.2

Inference Concerning (p t -

307

/».'

appropriate:

is

Substituting the sample data and the appropriate value from Table B.2A for

95%

confidence:

0.09

±1.960

0.24(1-0.24)

250

+

0.15(1-0.15) 125

0.09 ±1.960(0.042) 0.09

±0.082

(0.008,0.172)

Thus, we are

95%

confident that the true difference in the population pro-

new treatment as compared to children on standard treatment who used the ER during a 6-month period is between 0.8% and 17.2%. Based on the confidence interval estimate, can we say that there is a significant difference in the proportions of asthmatic children on the new treatment as compared to the standard treatment who used the ER during a 6-month period? (Hint: Does the confidence interval estimate include 0?) portions of asthmatic children on the

Is

the

new treatment

NOTE: The

Notice the direction of the

effective?

effect.

two-sample confidence interval concerning (p, — p 2 opposed to the value of either proportion

difference in proportions, as in

In

estimates

)

(as

was

the

the case

the one-sample applications).

some

applications,

it is

of interest to

compare two populations on

the

basis of the proportions of successes in each using a formal test of hypothesis.

Table 7.3 contains the

Table

test statistic for tests

-j.yrest statistic

f>>r i/>,

concerning

Large samples* Simple random samples from binomial populations Independent populations

1,

Statistic

st

/;,).

P\-Pl

Z= !

where

\\

p\

P

r

-

-Pz)

Attributes

*min(n,p,.

,

=

"1

']

p

I

and p \

\

n

308

Chapter 7 Categorical Data

Example 7.8

Testing Difference in Proportions of Patients

Who

Experience

Pain Relief Between Treatments

A new drug is being compared to an existing drug for its effectiveness in relievOne hundred subjects who suffer from chronic headaches randomly assigned to either Group 1: Existing Drug, or Group 2: New Drug. Subjects do not know which drug they are taking in this experiment. Subjects are provided with a single dose of the assigned drug and instructed to take the full dose as soon as they experience headache pain and to record whether or not they experience relief from headache pain within 60 minutes. Among the 50 subjects assigned to Group 1: Existing Drug, 28 reported relief from headache pain within 60 minutes. Among the 50 subjects assigned to Group 2: New Drug, 34 reported relief from headache pain within 60 minutes. Based on the data, is the proportion of subjects reporting relief from headache pain within 60 minutes under the New Drug significantly different from the proportion of subjects reporting relief within 60 minutes under the Existing Drug? Use a 5% level of significance. The data layout is as follows: ing headache pain.

are

2:

n = 50 28

X=

1.

Set

n

New Drug

= 50

X=

(relief)

34

(relief)

up hypotheses.

W

:

Pi

= p2 0.05

where p

x

p2

2.

=

the proportion of patients

=

headache pain using the Existing Drug the proportion of patients who experience headache pain using the New Drug

who

experience relief from

Select the appropriate test statistic.

The sample proportions

are

28

P.

= j- = Q

0.56,

k=

| = 0.68

relief

from

7.4 Statistical Inference

First,

Concerning

ip,

check whether or not the sample sizes are sufficiently

min(«i£i,wi(l - pi))

= =

— pJ

309

large:

min(50(0.56),50(l -0.56)) min(28,22) = 22 > 5 •

and min(n : p:,n : (l - p ;

The appropriate

))

= =

min(50(0.68), 50(

min(34, 16)

1

>

-0.68)) 5



—+—

Decision rule. Reject

Do 4.

16

given in Table 7.3:

test statistic is

Ip(l-p)

3.

=

H

if

Z < -1.960

not reject

H

Z> -1.960 < Z
5.99

H

if

x

2

< 5.99

Test statistic.

To organize table

is

the computations of the test statistic (7.7), the following

used:

Mondavs Time

6:00-7:30 PM

Slot:

Thursdays 4:00-5:30'

PM

Saturdays 8:00-9:30 am

Total

O = observed frequency:

47

32

21

100

E=

33.3

33.3

333

100

13.7

-1.3

-12.3

(-1.3)-/33.3

(-12.3) 2 /33.3 = 4.54

expected frequency:

(O -

E):

(O - E) 2 /E:

2

(13.7) /33.3

=

NOTE: The sum frequencies (n

The

=

5.64

of the expected frequencies

100).

test statistic

is

x

2

=

10.23.

=

is

0.05

equal to the

sum of

10.23

the obseryed

7.5 Chi-Square Tests

5.

Conclusion.

H

Reject to

show

since 10.23

>

5.99.

We

have significant evidence, a

=

0.05,

that the three time slots are not equally popular or convenient for

the patients. In fact, almost half

6:00-7:30 pm

SAS Example 7.9

313

Goodness of

slot.

For

this

(47%) of the

patients selected the

example, p < 0.01

Fit Test for Patient Preferences

(see

Monday

Table B.5).

Using SAS

The following output was generated using SAS Proc Freq with an option run a goodness-of-fit

responses under the null hypothesis (see Section 7.9 for the

SAS Output

for

to

Specifically, the user specifies the distribution of

test.

SAS

code).

Example 7.9

The FREQ Procedure day

Frequency

Mon Sat

47 21

Thurs

32

Percent

Test Percent

47.00 21.00 32.00

33.00 33.00 33.00

Cumulative Frequency

Cumulative Percent

47 68

100

47.00 68.00 100.00

Chi-Square Test for Specified Proportions

Chi-Square

10.3333

DF

2

Pr > ChiSq

0.0057

Sample Size

Interpretation of

SAS

first

=

SAS Output

100

for

Example 7.9

generates a frequency distribution table and provides the

and percent of respondents

in

each response category. SAS then

lists

number the Test

Percent in each category (these are supplied by the user and reflect the ex-

pected proportions).

The

and cumulative percents square (df

reject

H

()

two columns contain the cumulative frequencies sample data. SAS then produces the chi-

for the

freedom : p value. Here / = 10.33 and p = 0.0()s~. We therefore because p = 0.0057 < 0.05 and conclude that the three time slots .ire

statistic

= k-

last

1)

and

for the goodness-of-fit test along with degrees of a

not equally popular or convenient for the patients.

^ 314

Chapter 7 Categorical Data

Example 7.10

Goodness of

Teen Issues

Fit Test for

Volunteers at a teen hotline have been assigned based on the assumption that

40%

of

all calls

are drug related,

are stress related,

each

and

10%

call is classified into

the caller.

To

test the

25%

are sex related (e.g., date rape),

concern educational

issues.

For

one category based on the primary issue raised by

hypothesis, the following data are collected from 120

randomly

selected calls placed to the teen hotline. Based on the data, assumption regarding the distribution of topic issues appropriate?

Topic Issue:

Set

is

the

Drugs

Sex

Stress

Education

52

38

21

9

Number of calls:

1.

25%

this investigation,

up hypotheses.

H

:

H,:

p

x

H

= is

=

0.40, p 2

0.25, p i

=

0.25,

p4

=

0.10

false

or

H

:

H,: 2.

Distribution across categories

H

is

a

false,

=

is

0.40, 0.25, 0.25, 0.10

0.05

Select the appropriate test statistic.

x

2

^ (Q-£) =E—

2

where J^ indicates summation over the k response categories

O = observed E = expected 3.

Decision

frequencies frequencies

H

(i.e., if

is

true, or

under

H

)

rule.

In order to select the appropriate critical value,

we

first

determine the

degrees of freedom.

df=fc-l=4-l=3 The appropriate decision rule

critical

value of x

1

X

is

2

=

7.815 from Table B.5. The

is

Reject

Do

H

if

x

not reject

2

>

H

7.8 15 if

x

2

< 7.815

7.5 Chi-Square h:,

i

the following

used:

is

Drugs

Topic Issue:

Sex

Stress

Education

TOTAL

O = observed frequency:

52

38

21

9

120

£=

48

30

30

12

12(1

4

8

-9

-3

expected frequency:

(O -

E):

(0-E) l /E:

48

A

=

2

(8)

0.33

NOTE: The

/30

=

2.13

2

(-9) /30 2.70

=

2

=p

p.

Confidence interval estimate for I

P\

-

Test //

Pi

:

p.

(pi

-

/>:>

-

-

1

Pi'

.

P:
':

»i

'

Z

po)

-Pi'

1

Z,.

for

1

.

See Table 7.1 for necessary conditions

P-Po

Z=

/

necessary conditions

n

Pi

Z

in

Table B.2A

See Table

7.

3

(find

tor necessar)

conditions and definitions of components of Z Test //

:

distribution

=

X

of responses follows

?

2-

'

df

= *~

C

In-square



goodness-of-fit test

specified pattern

Test //

:

two variables

are independent

Find n to estimate

,

£