Bancroft’s Introduction to Biostatistics [2 ed.]

254 11 28MB

English Pages 246 Year 1970

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Introduction to Stochastic Processes in Biostatistics

541 53 34MB Read more

Introduction to Biotechnology and Biostatistics 1774073668, 9781774073667

Introduction to Biotechnology and Biostatistics is a book which introduces the concept of biotechnology and biostatistic

603 48 12MB Read more

Introduction to Biostatistics with JMP 164295456X, 9781642954562

Explore biostatistics using JMP in this refreshing introduction Presented in an easy-to-understand way, Introduction to

349 104 12MB Read more

Introduction to Biostatistics with JMP 9781642954562, 9781629606330, 9781635267204, 9781635267181, 9781635267198

261 109 16MB Read more

An Introduction to Biostatistics [3 ed.] 1478627794

2,365 207 5MB Read more

Introduction to Biostatistics using R (Team-IRA) 1774690403, 9781774690406

This book covers some introductory steps in biostatistics using R programming language. Biostatistics is the branch of s

152 53 74MB Read more

Biostatistics: An Introduction and Conceptual Critique 9781000790443, 1000790444

Without question, biostatistical analysis has contributed to a slew of amazing medical breakthroughs. Yet it also distor

113 96 3MB Read more

BIOSTATISTICS: An Introductory Text

737 90 16MB Read more

Physeo Biostatistics

MEDICAL COURSE AND STEP 1 REVIEW FIRST EDITION Accompanies online videos taught by Rhett Thomson & Michael Christen

411 11 2MB Read more

Biostatistik [Biostatistics, IN GERMAN]

503 95 3MB Read more

Bancroft’s Introduction to Biostatistics [2 ed.]

Author / Uploaded
Johannes Ipsen
Polly Feigl

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Digitized by the Internet Archive in

2015

https://archive.org/details/bancroftsintroduOObanc

Bancroft’s

Introduction to Biostatistics SECOND EDITION JOHANNES IPSEN,

M.D., M.P.H.

Professor of Medical Statistics and Epidemiology,

Department of Community Medicine, School of Medicine, University of Pennsylvania, Philadelphia

POLLY FEIGL, PH.D. Assistant Professor of Biostatistics,

Department of

Biostatistics,

Public Health and

School of

Community

Medicine, University of Washington, Seattle

Medical Department

HARPER New

&

ROW, PUBLISHERS

York, Evanston, and London

BANCROFT’S INTRODUCTION TO BIOSTATISTICS 1970 by Harper & Row, Publishers, Inc. All rights reserved. book may be used or reproduced in any manner whatsoever without written permission except in the case of brief quotations embodied in critical articles and reviews. Printed in the United States of America. For information address Medical Department, Harper & Row, Publishers, Inc., 49 East 33rd Street, New York, N.Y. 10016.

Copyright

No

© 1957,

part of this

FIRST EDITION

LIBRARY OF CONGRESS CATALOG CARD NUMBER! 74-106338

Contents

Preface to the Second Edition

vii

Preface to the First Edition

ix

1

Introduction

1

2

Distributions

4

3

The Normal Distribution

4

The Mean and Other Measures of Central Tendency

5

The Standard Deviation and Other Measures of Variation

6

Significance of Differences in

7

The Binomial and Poisson

8

Significance of Differences in Proportions

9

Correlation and Regression

19

Means

Distributions

Goodness of

11

Medical and Vital

Statistics

115

12

Comparison of Rate Tables

130

13

Evaluation of Effectivity and Risk

14

Techniques in Follow-Up Studies

15

Bioassay

16

Design of Experiment

Index

77

106

Fit

142 153

163 171

Appendix on Computational Methods

List of

65

88

10

Tables

55

205

Answers 217

to

Problems

213

185

26 42

Preface to the Second Edition

In revising this textbook 12 years after

its first

appearance,

we have

chosen to follow the outline and approach of the First Edition as closely as possible.

with

its

The

We

have

tried to retain the spirit of the original

emphasis on application of biostatistics to medical problems. attention to

small sample procedures and computational

The

instruction has been increased. text; the

F

test is

t

test is

presented early in the

introduced and; accurate computational formulae

are stressed.

Growing

familiarity with statistics, as evidenced

literature in the last decade,

by the medical

has motivated the inclusion of discussions

of relative risk, measures of effectiveness, and design of experiments.

We

admire the

especially

clinical

late Dr. Bancroft’s choice of pertinent

examples and although some have been updated, the bulk of

her problems have been kept.

A

list

of computational answers has

been added. Like Dr. Bancroft, we hope the book

will

be useful for

self-study as well as in the classroom.

We are indebted

to the Literary Executor of the late Sir

Fisher, F.R.S., to Dr. Ltd.,

Frank Yates, F.R.S., and

Edinburgh, for permission to reprint Table

Ronald A.

to Oliver

C from

&

their

Boyd book

for Biological Agricultural and Medical Research. Also we wish to thank W. G. Cochran, G. M. Cox, and John Wiley & Statistical Tables

Sons

Inc.,

New

,

York, for permission to reprint Table

E from

their

book Experimental Designs. Philadelphia , Pennsylvania

Johannes Ipsen Polly Feigl

Preface to the First Edition

The present textbook represents the third revision of a series of mimeographed notes prepared for use in the teaching of biostatistics to sophomore medical students at Tulane University. The apparent usefulness to, and acceptance by, the students here during the past 9 years has led the author to agree to its publication. As originally written

it

has been used in a course covering approximately 48 hours,

of which 15 are spent in lectures, the balance in supervised laboratory

work.

The

mainly with statistical methods appropriate for Frequency distributions, tabulation, graphing, use of centering constants and measures of variation, and other descriptive methods are treated in the introductory chapters. Further topics text deals

large samples.

include the binomial,

x

2 >

and the use of the normal distribution

in

sample approximate tests of significance for differences in sample means and proportions. Additional chapters deal with the large

t

tests

and

vital statistics (including the

use of the modified

life

table

techniques in follow-up studies), and the final chapter presents a brief discussion of quantal bioassay

and the Reed-Muench method.

Because the book was written for medical students and practicing physicians,

who

presentation

is

frequently have in

little

training in mathematics, the

simple, nontechnical terms.

No

mathematics beyond elementary high-school algebra

knowledge of is

required for

understanding. Since the book will be used primarily by the medical profession, illustrations in the text

and examples

at the

end of each

chapter have been taken almost entirely from clinical medicine.

The author wishes

to

acknowledge her gratitude to Dr. John

Fertig,

Dr. Alan Treloar, and Dr. R. A. Fisher and their publishers for IX

Preface to the First Edition

X

permission to print the tables dealing with size of sample, the x 2

and

t

,

distribution, respectively.

Dr. Lila Elveback has given ration of the manuscript.

many

To Miss

helpful suggestions in the prepa-

Ethel Eaton credit should be given

for her painstaking preparation of the illustrations as well as for

of the computations. To Mary Grace Kelleher, Patricia Spaid, and Arva Boesch thanks are due for the careful typing of the

many

manuscript.

New

Orleans Louisiana ,

Huldah Bancroft

Introduction

A cine

growing emphasis on the role of quantitative methods in medimakes it imperative that the student of medical science have

some knowledge of

statistics.

The medical student while

in school is

taught the best method of diagnosis and therapy. After graduation

he must of necessity depend on current literature to learn new methods of therapy, diagnosis, and prevention. Thus he must be able to evaluate for himself the results of other workers.

when

new technique or method

He must

decide

supplement or replace an older one. He must be able to answer a mother’s question about new preventive measures with as much surety as he now advises her regarding vaccination against smallpox and polio. He should be able a

shall

to give the family intelligent assurance of the prognosis for a given

Such prognosis may depend on his ability adequately to appraise laboratory findings as well as on his knowledge of the relation of age, sex, and other conditions of the patient to a particular disease. New knowledge regarding facts such as these will come to the physician through research work done by himself or by others. He must, therefore, be able to select from masses of information that which is of high caliber and which will pass rigid scientific tests. He must develop a healthy skepticism toward everything he reads.

patient.

l

Introduction to Biostatistics

2

how can

Just

a knowledge of statistical techniques assist

him

in

and foremost, he must recognize that individuals this problem? vary not only from each other but also within themselves from day First

—

to day.

A

certain

amount of

facing the physician

is

variation

is

that of determining

normal, but the question

when

a specific variation

must learn how becomes pathologic. variation in normal individuals is measured and what the range of normal variation is. He must learn that there is some error present in every measurement or count made. It is highly unlikely that two successive red blood cell counts made on the same specimen of blood

To

assess this, the student

be identical. When, therefore, does a difference become greater

will

than error of measurement? For example, a patient had a red blood

count of 4.3 million

cell

ports 4.2 million

cells.

Another patient

is

Two

hours

patient.

Does

this indicate that the

admitted with a white blood

later the laboratory reports a

Does

The laboratory today

cells yesterday.

this indicate a real rise in

count cell

is

re-

decreasing?

count of 6,000.

count of 6,200 on the same

white count? In other words,

can these differences be explained by the inaccuracy inherent

in the

method of counting blood cells? To treat his patient with the utmost skill the physician must know the answers to questions such as these. For every measurement or determination provided by the laboratory, the physician should know the variation that is a part of the method itself otherwise he will not know when a given variation represents ;

a real change in a patient.

Whenever new methods of diagnosis or therapy are introduced, the question to be answered is whether the specific new method under consideration

is

superior to the old method. Critical evaluation of

the experimental

study must be made.

Questions that must be

answered are: 1.

Were

controls used in measurement of results? If so, were

Were the experimental group and the congroup as nearly alike as possible with regard to all known factors that would affect the outcome? Were all the factors that differentiated the group identified and evaluated?

they well selected? trol

2.

Was

the difference in the results obtained greater than could be attributed to chance or normal variation?

Only when these questions have been answered can conclusions new method be drawn. It must be recognized that there

regarding a

1.

3

Introduction

are no statistical techniques available that will prove that one treat-

ment

is

than another. They

in all respects better

ference in

particular instance the difference

More

will,

however, give

no real diftreatment so that the reader may conclude whether in a

the probability of a given difference occurring

is

if

there

is

significant.

generally, critical assessment of medical information often

knowledge of basic statistical concepts. Some of the most fundamental of them are presented in the following chapters. An requires

attempt

is

made

minimum

to present concepts with a

of technical

terminology and mathematical proof. The science of medical

cannot be covered in an elementary text

—

in part

increasing complexity of the medical sciences. gator, unlike his basic-science colleagues, his patient in his

vations. strictly

He

has

attempt to make difficulty

The medical

investi-

restricted

by regard for

sound

clinical obser-

scientifically

studying one or two variables alone in a

designed investigation.

variables, the analysis of

is

statistics

because of the

Any

realistic

study involves

which may require a variety of

many

statistical

approaches and techniques. This book tests,

may

enable the reader to perform simple

but the primary objective

is

to give

him

statistical

insight into statistical

thought. Assuredly, most sophisticated data treatment will

modern research

setting

cian, but the medical

— be

man

undertaken by a professional

— in

should be on speaking terms with

worker so that he can both advise him

in

the

statisti-

this

problem formulation and

interpret the results of the statistical analyses.

2 Distributions

It

has been said that

all

knowledge

arises

through some process of

may be as simple as noting whether an event occurs or does not occur, such as the fact that an individual wears glasses or does not wear glasses. On the other hand, the obserobservation. This observation

may

more complicated procedure, such as the measurement of the amount of hemoglobin in 100 cc of blood. The characteristic being observed or measured is called a variable The vation

involve a

.

variable takes unit)

on a value

for each subject (or other experimental

under observation. Typically these observed values

subject to subject, thus providing the variability which

from the raw

differ is

material of statistics. Statistics is the art

text

we

and science of numerical

distributions. In this

are dealing mostly with sample distributions-

that

tions of total populations or universes. If the samples can be

is,

frac-

assumed

sample distribuFor example, counts of white blood cells in a sample

to represent the population, characteristics of the

tion can be generalized to hold for the total population.

the distribution of differential

of 40 healthy adult

men may

be used to describe that distribution in

the total population of healthy adult 4

men.

5

2. Distributions

Regardless of the type of variable, it must be observed, recorded, and transcribed for arrangement in a distribution. The “anatomy” of a distribution of a variable is mainly as follows: 1.

The sample

size

(/?)

is

the

number of observations of

the

variable 2.

The sample space

(x)

is

the set consisting of

all

possible values

of the observations 3.

The

classes

(.**)

are mutually exclusive

and exhaustive subsets

of the sample space

The frequency (/,) with which an observation occurs Xi is the number of observations in that class

4.

It

follows that the

sum of

in class

frequencies /* must equal n, or else a mis-

count has occurred or the scale has insufficient classes to account for all observations.

The kind of

scale to be used

depends on the nature of the variable

under observation. Enumerative

(i.e.,

count) data are on nominal

or ordinal scales.

A

nominal scale has merely descriptive

Male, female, unknown Heart disease, cancer, stroke,

Sex Cause of death

The common yes-no

scale

is

all

others

called dichotomous. Unfortunately

dichotomous end up with an additional

“not known” or some

An

example:

Class

Variable

scales intended as

classes, for

title

many class:

of that nature.

important nominal distribution

is

diagnoses of diseases and

causes of death. International agreement over almost a century has

some

produced a three-digit code which places

all

gory that not necessarily follows a

anatomical or etiological

system, but which

is

most

strict

diseases in

cate-

useful for comparative studies.

Ordinal or rank scales have also descriptive classes, but they are

arranged in order of intensity. For example: Clinical outcome dead, :

unchanged,

Many

better, recovered; or Tissue reaction

:

0, -f,

+ ++.

such scales, although seemingly subjective, are very useful in

the hands of expert clinicians and research scientists.

A measured variable is one whose observed values are recorded on a numerical scale. Such a measured (or quantitative) variable

— Introduction to Biostatistics

6

can be

classified as

being either discrete or continuous.

outcome

implies a restriction of the possible

The former

to isolated values

— for

.... A variable is continuous it could take on any value in certain intervals. if, conceptually, Continuous scales are used for measurements when fractions are real and meaningful, such as 12.1 gm hemoglobin per 100 cc, or example,

5.3

mg

Number of pregnancies

0, 1,

:

uric acid. In practice, the distinction

and continuous numerical discrete scales

may

scales

be expressed

between the discrete

not very important. Averages of

is

in fractions,

such as 1.45 preg-

nancies as a collective experience of a group of female patients. Further, a continuous scale

may

often

— for ease of presentation

appear with subsets of whole numbers

The

desirable scale for a

The

length or range of a scale

(e.g.,

height 62, 63,

.

.

.

in.).

measured variable (whether discrete or continuous) is an interval scale, which is an ordinal scale characterized by a common and constant unit of measurement. Temperature (degrees F or degrees C) is an example of a measured variable with an interval scale. An even stronger measurement scale is the ratio scale, which is an interval scale with a true zero point. Weight is a variable measured on a ratio scale. the classes.

A

dichotomous

is

scale

determined by the definition of

is

said to

have two points: one

with r classes has r points. Theoretically, some numerical scales have infinite ranges,

but in practice we usually limit the range by making

a collective class at one or both ends

— for

example, pregnancies

1,2, 3, 4, 5 or more. For a variable such as hemoglobin, classes can be made for rare high and low hemoglobin values (such as less than 4.0 gm or more than 20.0 gm). 0,

The

first

step in the art of statistics

forms that are manageable without the

The observations must be arranged in The original records may be patient experimental laboratory diaries.

A

is

the reduction of data to

loss of informative details.

distributions of each variable. histories, interview sheets, or

laborious

first

view

may

be pro-

vided on “master sheets” on which each unit (person, animal, ex-

periment)

is

entered on a line and the measured or classed variables

are entered in separate columns as numbers, letters, or signs. Looking

down

these columns, an idea of useful scales for each variable may be formed, and the arrangements for frequency counts can be made.

Counting

-f-s,

— s,

fraught with errors.

?s, etc.,

on the master sheet soon proves

A

satisfactory procedure

more

on cards that can be sorted and counted

is

to be

to put the data

— electronically or by hand.

2.

Distributions

A

7

card should contain

tifications

all

the essential measurements and iden-

of the “unit” (the subject or the experimental animal).

In small samples a blank 3- by 5-in. card will be found very satisfac-

most statistical studies will computation of averages, correlations, etc., punch cards proceed to that can be read, sorted, and further processed by electronic comtory for frequency counts, but since

puters are preferable. Since the

first

edition of this book,

such

computers have become generally available at most medical centers; thus even an elementary text on medical statistics can be based on the assumption that electronic data processing

is

available.

most important step for which the research investigator has the sole responsibility and to which he applies his insight in biomedical matters. Ideally, the resulting card is an expression of the intent and hypotheses of the Transfer of information to cards

is

a

research project.

A standard

punch card

columns and 12 rows. moves from column to 80 and as it

(see Fig. 2-1) has 80

In the punch machine the card

1

column the operator presses a typewriterlike key punch which inserts one or more holes in the appropriate row according to a standard mechanized code. Each decimal digit is a single punch in row 0, 1, 2, ... or 9. Each letter and arithmetic sign is represented by a combination of multiple punches in the same column. A set of adjacent columns is called a field and information is carried in one-column fields, two-column fields, etc. The investigator makes a code instruction or “card lay out” indicating which single or consecutive columns should be assigned for presents each

each variable available for the subject. Also, the instruction explains the coding of nominal classes by numerals. Table 2-1 is a modified

Introduction to Biostatistics

8

code instruction for a study reported by C. M. Kunin and R. C. McCormack, New Eng. J. Med. 278:636-612, 1968, which dealt with bacteriuria and blood pressure among nuns and working

women. Some more

variables which were included originally are

omitted here.

Table 2-1. Code Instruction-Bacteriuria and Blood Pressure

Column

Width

1-3

3

Identifying no.

4

1

Occupation

001-999 0-5

5

1

Race

0-1

6

1

Bacteriuria

0-7

1

Glucose

8

1

Protein in urine

Women Remarks

Range

Variable

7

in

0-1

in urine

0-1

9-10

2

Age

11-13

3

Height

00.0-99.9

14-17

4

Weight

000.0-999.9

18-20

3

Arm

21

1

Marital status

00-99

circumference

00.0-99.9

0-4

As 0

is

Nun

1

Nurse

2 Teacher

3 Clerical

4 Factory

5 Unskilled

0 White 0 None 2 Klebs. 4 Staph. 6 Others 0 No

0 No Years as

1

Negro

1

E. coli

3

Proteus

5

Yeast

7

Not done

1

1

Yes Yes

is

Inches in tenths

Pounds

in tenths

Centimeters in tenths

0 Never married 1 Married 2 Divorced 3

Widowed

4

Not known

Systolic blood

22-24

pressure

3

000-999

mm

Hg

as

is

000-999

mm

Hg

as

is

Diastolic blood

25-27

It will

point units

pressure

3

be noted that

it is

not necessary to introduce the decimal

punch card as long as the instruction explains the that are used. Although letters can be punched they are not itself in

the

practical for further data processing.

A

punch card from this study bearing the string of numbers (see 098201004465512652421145085 can be read: “Study No. 98. 44-year-old married white teacher. Urine: no glycose, no protein, > 200,000 E. coli. Weight 126.5 lb., height 5 ft. 5§ in. Arm circumference 24.2 cm, blood pressure 145/85. Fig. 2-1)

2.

9

Distributions

The position of the punch is all important. The column means of identifying the value of a variable.

is

the sole

Sorting of cards to obtain frequencies simply consists of placing

them in stacks for each class of a variable and counting the number. For example, one may start with two stacks, male and female. Each stack is again sorted in age groups, say, and again each age group sorted in some “yes-no” group. Hand sorting and counting is a task that must be meticulously performed to avoid misplacements and counting errors.

A

mechanical card sorter

is

a standard piece of

data-processing equipment which sorts punch cards quickly and accurately.

The observed frequencies should be tabulated in a distribution. There are work tables and final tables, the latter to accompany printed

or oral

presentations.

The work

voluminous, and detailed; they serve to after

tables

are

numerous,

facilitate statistical analysis

each variable distribution has been inspected.

Published tables should carry the information and demonstrate points discussed in the text. Although there are

no hard-and-fast

rules governing table construction, there are certain general prin-

have become accepted as more or less standard. Many journals have editorial instructions with respect to tables that are worth checking before the tables are finally prepared.

ciples that

good

1.

scientific

The

table should be as simple as possible.

Two

or three small

tables are preferable to a single large table that contains

many

details or variables. 2.

The

table should be self-explanatory.

For that purpose:

or symbols are used, these should

a.

If codes, abbreviations,

b.

Each row and each column should be labeled concisely and

be explained in detailed in a footnote. clearly. c.

d.

The The

specific units title

of measure for the data should be given.

should be clear, concise, and to the point.

A good

answer the questions: What? When? Where? Totals should be shown. These may be given in the top row and the first column to the left of the table or in the bottom

title will e.

row and the last column to the right. The exact position chosen will depend on the relative importance of the totals to the body of the table as well as the number of groups or

Introduction to Biostatistics

10

class intervals in the table. If the table

is

an unusually large

shown in both positions. commonly separated from the body of the

one, totals are sometimes 3.

The by

title is

may

the columns 4.

table

lines or spaces. If the table is small, vertical lines separating

If the

not be necessary.

data are not original, their source should be given in a

footnote.

The

simplest form of a table

a listing of the classes

is

of a variable scale and their observed frequencies. Because

most

comparison such one-way by cross tabulations of two or more

statistical analysis involves

tables are often replaced

variables. Editorial space limitations often force the author

to publish only the

Table 2-2

is

most informative

a two-way table of two nominal variables taken from

the study outlined in Table 2-1. (control

tables.

women and

The two

variables are (1) group

nuns) and (2) species of organism

(

E

.

coli,

Table 2-2. Distribution of Microorganisms Cultured from Two Populations of Females Found to Have Significant Bacteriuria Outside the Hospital

Control

Organism

No.

women

Nuns

Total

%

No.

%

No.

%

E. coli

77

62.6

43

79.6

120

67.8

Klebsiella

34

27.6

5

9.2

39

22.0

Staphylococcus

7

5.7

2

3.7

9

5.1

Proteus

5

4.1

2

3.7

7

3.9

Yeast Other

0 0

0.0

1

1.9

1

0.6

0.0

1

1.9

1

0.6

123

100.0

54

100.0

177

100.0

Total

Source: C.

M. Kunin and R. C. McCormack, New Eng.

J.

Med. 278: 638, 1968.

The table shows three nominal frequency distribuone for the control women, one for nuns, and one for the com-

Klebsiella , etc.). tions:

bined groups.

The

first

“point” of

this table is just to

show

the

frequency of various species. The nominal-scale classes have been rearranged so that their rank order is that of the overall relative

frequency.

The second point

is

to

show

that the

most frequent

Distributions

2.

microbe

11

somewhat

differs

in

relative frequency

between the two

study groups. Hence, the absolute counts (frequency) in the subsets are converted to relative frequencies or percentages of the sample size.

The sum of

these should be entered as 100.0 in the “total” line.

This helps the reader to understand clearly that the listed percentages are indeed relative frequencies.

A

different use of percentages

is

shown

in

Table

which

2-3,

is

a

three-way display of age, study group, and presence of bacteriuria.

dichotomous distributions (bacteriuria “yes-no”), giving the sample sizes, the frequencies of “yes” only, and their relative frequencies. The bottom “total” line contains two such distributions, disregardActually,

ing age.

this

The

is

a

tabulation

of 12

without regard for

line totals give six distributions

specific study group.

The tabular presentation of

a frequency distribution or distribu-

Table 2-3. Frequency of Significant Bacteriuria (“Cases”) 2,698 Control Women, By Age

Control

Age

No.

15-24 25-34 35-44 45-54 55-64

>65 Total

Source: C.

tions

is

women

Cases

Among

3,304 Nuns and

Nuns

%

No.

Cases

/o

4

0.4

2

0.3

7

1.4

613

33

5.4

742

32

4.3

598

30

5.0

960 768 484

495 219

19

3.8

385

6

1.6

14

6.4

310

10

3.2

31

2

6.5

397

23

5.6

2698

130

4.8

3304

52

1.6

M. Kunin and R. C. McCormack, New Eng.

J.

Med. 278: 638, 1968.

often supplemented by a graph; this can be an effective

of displaying the data, especially during oral presentation.

drawn

way

When

an overnumbers of various magnitudes can usually be seen more quickly and easily from a graph than from a table. There are many types of graphs but

correctly all

a graph allows the reader to obtain rapidly

grasp of the material presented.

The

relationship between

Introduction to Biostatistics

12

an understanding of a few general types will suffice for most ordinary medical data. The choice of the particular form of graph to be used is often a matter of personal preference. This is true also of

many

of the details of the graph

general principles that are

Some

itself.

There

are,

commonly accepted

however, certain

as being preferable.

of the most important of these are:

1.

The

simplest type of graph consistent with

most 2.

effective.

No more

purpose

is

symbols should be used

the in a

single graph than the eye can easily follow. Every graph should be completely self-explanatory. Therefore,

it

should be correctly labeled as to

and explanatory 3.

lines or

its

The

title,

source, scales,

.keys or legends.

position of the

for a graph

title

In published graphs, however, the

is

one of personal choice.

title is

commonly

placed

below the graph. 4.

When more

than one variable

is

shown on

a graph,

each

should be clearly differentiated by means of legends or keys. 5.

The diagram or graph

generally proceeds from left to right

and from bottom to top. All writing should be placed, therefore, so as to read from the bottom or from the right-hand side of the page. 6.

No more

coordinate lines should be shown than are necessary

to guide the eye. 7.

Scale lines should be

drawn heavier than other coordinate

lines. 8.

The

lines

of the graph

coordinate or scale 9.

The

itself

should be heavier than either

lines.

Frequency

is

method of

classification

generally represented on the vertical scale, with

simplest type of graph

on the horizontal. is

the bar diagram.

useful for characterizing frequency distributions of

It

is

especially

nominal

vari-

and quantitative variables of discrete type. Figure 2-2 is a simple bar graph based on the total data for 177 women as listed in Table 2-2. The nominal variable, microorganism type, is shown divided into its six classes: E. coli, Klebsiella etc. In this diagram the bars representing each type are drawn of equal width and with length proportional to their frequency. Therefore,

ables, ordinal variables,

,

bar area

is

proportional to length. Comparison of the length of the

2.

Distributions

13

bars gives a visual picture of the frequency of occurrence of the different types;

many

it

shows, for example, that there were about thrice as

cases of E. coli as Klebsiella.

Bars

may be drawn

either horizontally or vertically. Regardless

of the direction of the bars the scale line must start at zero or a

wrong impression

will result.

Usually the diagram will be more

attractive if the bars are wider than the spaces

between them. Pref-

erably the scale lines should be independent of the bars. Subclassifi-

may be shown by the use of multiple bars, in which case the diagram needs an appropriate legend. When one is comparing two or more proportional distributions of either qualitative or quantitative data, in which the number of classes is relatively small, a form of graph known as proportional bar diagram is especially useful. Data from Table 2-2 are shown in this type of diagram in Figure 2-3. In this diagram a single bar, 100% in length, represents each distribution. Each bar is then

cation of the data

divided into sections that correspond in length to the relative fre-

quencies of the classes. That

is,

there are 77 E. coli cases

among

the

123 controls so the length of the section representing these cases

62.6% of the control two dark

more of

bar. It

solid sections

and

is

is

easy to compare the lengths of the

see that for the

nuns proportionately

the cases were E. coli than for the controls.

Figure 2-3.

CONTROLS

The histogram

is

Proportional bar dia-

NUNS a diagram used exclusively for showing frequency

distributions of quantitative data that are continuous in nature. It is

an area diagram composed of adjacent rectangles.

essentially

Hence, the areas used to represent the class frequencies when added together will give the composite area for the entire distribution.

Figure 2-4 shows a histogram for the age distribution of the 2,698

800

600

400

200

0

AGE Figure 2-4. trol

14

women.

Histogram for the age

distribution of 2,698 con-

HEIGHTS FOR ADULTS

Figure 2-5.

control

FAMILY

INCOME

Illustrations of various types of frequency distributions.

women which was

vals are represented

given in Table 2-3. The 10-year age interby rectangles of equal width with heights pro-

portional to the numbers of observations falling in the intervals.

Thus the area above an age

interval indicates the frequency of

occurrence of ages in that interval. If the age intervals had not been equal the number of observations would have been equally dis-

number of years for each interval in the histogram. The histogram is a presentation of the sample frequency distribution. As sample size increases and class intervals are shortened, the tributed over the

sample approaches the entire universe of values and the outline of 15

Introduction to Biostatistics

16

becomes smooth. Distributions of various shapes are of these are characterized by the fact that they are found. symmetrical, have only one peak, and build up gradually from a fairly low number at the two extremes of the scale to a maximum in the middle. One distribution of this type is known as the normal the histogram

Many

distribution (Fig. 2-5 A).

skewed. They

may be skewed

whether the long

Two

Some

tail

examples of

distributions are asymmetrical or

to the right or to the

of the distribution

is

on the

left,

depending on

right or left side.

this type are illustrated in Figures 2-5

B and

D.

Both of these are skewed to the right. Figure 2 -5B has two peaks (bimodal) while Figure 2 -5D

is

a

more one-peak

of distribution. Figure 2-5C shows

(or unimodal) type

another distribution that

still

has two peaks, one in early infancy and one in old age. Applications of statistical techniques to biological data

often

assume, explicitly or implicitly, a particular mathematical formula for the distribution of the universe of values being sampled. This

formula gives the relative frequency, or density, for a given measure-

ment or scale class. Only three such distributions will be treated in this book the normal, the binomial, and the Poisson. Data measured on a continuous scale are often compatible with the assumption that they were drawn from a normal distribution with its familiar bell:

shaped frequency curve. This distribution

is

very briefly described

Data measured on a discrete scale in which the observations take on only the values 0, 1, 2, often follow the binomial or Poisson distributions (discussed in Chapter 7). in

Chapter

3.

.

PROBLEM

.

.

2-1

For a group of 105 individuals the blood plasma potassium (expressed in liter) ranged from 2.46 to 4.32. Assuming that you wish to classify these data into equal class intervals, set up the class-interval limits you would use. (See List of Answers to Problems, at end of book.) milliequivalents per

PROBLEM Assume

2-2

you are making a study of the relationship between the sodium cell and in the blood plasma. You wish at the same time to study the relationship between these factors and age, race, and sex. Set up a code instruction sheet for transfer of data to 80-column punch cards. in the red

that

blood

2.

Distributions

PROBLEM

17

2-3

The following paragraph is from J.A.M.A. 153: 1505-1508, 1953. Gastric resection was performed in 474 patients with benign gastric ulcer, and in 7 of these all the stomach was removed. Gastric resection was performed on 60 patients with malignant ulcers, and 9 total gastrectomies were performed in this group. In 23 patients with

benign ulcers and 21 patients with malignant

ulcers, other surgical procedures, including exploratory

and

laparotomy, repair of

were used. There were 415 patients with benign gastric ulcers and 7 patients with malignant gastric ulcers who had medical treatment only. perforation, gastroenterostomy,

ligation of a bleeding artery

and source

1.

Arrange these data

2.

Write a code instruction for data transfer to punch cards.

PROBLEM

in tabular form, with totals,

title,

reference.

2-4

In a serious epidemic of poliomyelitis in England and Wales in 1947,

shown

that the case fatality varied with the extent of paralysis.

it was For persons

with paralysis of limbs and/or trunk the case fatality was 5.8%; for those with paralysis of other parts of the body it was 32.8%; for those with no paralysis it

was 2.5%. Show these data

PROBLEM

2-5

The following

table

in the

form of a bar diagram.

shows the deaths from cancer of the breast and cancer of

the lung in the United States for the years 1939 to 1948.

Because of a change in classification by the National Office of Vital

Statistics,

data are not available for deaths from cancer of the lung after 1948. Deaths from cancer of the lung are grouped with deaths from cancer of the bronchus and trachea in the data for 1949 and

all

succeeding years. There were 20,909 deaths

recorded in 1954 as due to cancer of the breast; 24,788 deaths recorded as due to cancer of the lung, bronchus,

and trachea. Deaths from the

latter

two

entities

are relatively small in relation to those from cancer of the lung. 1.

Compare

graphically the change in the

number of deaths from cancer of

the

breast with that of cancer of the lung by: a. b.

Line diagrams on arithmetic paper. Line diagrams on arithmetic paper showing the deaths expressed as a percentage of the deaths in 1939.

c.

Line diagrams on arithlog paper.

in

each year

Introduction to Biostatistics

18

Deaths in the United States from Cancer of the Breast and Cancer of the Lung, 1939-1948

Cancer of lung

Cancer of breast No. of

% of deaths

No. of

1939

deaths

% of deaths 1939

Year

deaths

1939

14,868

100.0

5,120

1940

15,488

104.2

5,430

106.1

1941

15,526

104.4

6,025

117.7

1942

15,945

107.3

6,329

123.6

1943

16,140

108.6

7,088

138.4

1944

in

in

100.0

16,379

110.2

7,621

148.8

1945

17,133

115.2

8,162

159.4

1946

17,516

117.8

8,864

173.1

1947

18,030

121.3

9,571

186.9

1948

19,162

128.9

10,493

204.9

•

Source: Vital Statistics of the United States, 1939-1948.

from cancer of these two

2.

Discuss the change in mortality these three graphs.

3.

Under what conditions would you

sites as

shown by

prefer to use each of these types of graphs?

The Normal Distribution

The normal

or Gaussian distribution has a symmetric frequency

curve shaped like a

bell. It is

completely specified by two parameters

mean and standard

called the

The mean,

//,

deviation, usually denoted

m and

a.

indicates the center of the distribution, the standard

deviation, a, indicates the spread or variability of the distribution,

the formula for the relative frequency curve

/(*)

=

1

(x

-

is

m)2

2(r 2

V27TCT 2

and e are known mathematical constants. The probability of an observation, x, following in the interval Xi, to x 2 is given as the area where

7r

under the curve f(x) between x\ and jc2 These areas (probabilities) have been extensively tabled (see Table A).* As an example of biologic data approximating the normal pattern, consider the distribution of 1,060 Tulane University School of .

*

Table

A

appears on page 207. Similarly, keyed tables (through E) follow. 19

Introduction to Biostatistics

20

Medicine freshman medical students’ pulse beats counted for 60 seconds at the end of an hour’s lecture (Table 3-1). Examination of this distribution shows that it is reasonably symmetrical with the frequency low at both ends of the distribution and a

maximum

in the pulse beat

group of 75 to

78.

The

frequencies

Table 3-1. Distribution of 1,060 Students by Resting Pulse Beat

Pulse

Frequency

beat

Accumulated

Expected

relative frequency

midpoint ordinate

per

Midpoint

minute

43-46 47-50 51-54 55-58 59-62 63-66 67-70 71-74 75-78 79-82 83-86 87-90 91-94 95-98 99-102 103-106 107-110

Mean = *

=

Total

Expected*

Observed

Expected

("/(*))

44.5

1

0.5

0.0009

0.0005

0.8

48.5

2

1.6

0.0028

0.0020

2.8

52.5

6

5.1

0.0085

0.0069

8.4

56.5

22

14.0

0.0292

0.0201

21.3

45.5

60.5

52

32.3

0.0783

0.0506

64.5

79

62.9

0.1528

0.1099

82.2

68.5

118

103.6

0.2642

0.2077

125.2

72.5

165

144.3

0.4198

0.3439

160.9

76.5

186

170.0

0.5953

0.5042

174.4

80.5

165

169.2

0.7509

0.6638

159.5

84.5

103

142.5

0.8481

0.7983

123.0

88.5

82

101.4

0.9255

0.8940

80.1

92.5

45

61.1

0.9679

0.9516

43.9

96.5

19

31.1

0.9858

0.9809

20.3

100.5

11

13.4

0.9962

0.9935

7.9

104.5

3

4.9

0.9991

0.9981

2.6

108.5

1

1.5

1.0000

0.9995

0.7

0

0.6

1.0000

1.0000

110+

S.D.

Observed

—

76.4.

9.7.

=

1,060//

maximum is reached in the range from 75 to 79. Although these measurements are discrete measurements, their pattern is characteristic of the normal frequency increase in each class interval until the

and is roughly approximated by it. For the distribution of the sample of 1,060 students according to pulse beats the mean is 76.4 beats per minute and the standard devia-

distribution

tion (s.D.)

is

9.7 beats per minute.

When

these values are placed in

3.

Normal

Distribution

21

the above equation, and the values of

x corresponding

to the mid-

point of each class interval substituted, the corresponding values of f(x) can be determined.

The

curve, nf(x), has been superimposed

on

the histogram of the distribution (Fig. 3-1).

Figure 3-1.

Histogram of the observed distribution

3-1 with superimposed normal distribution with

S.D.

in

Table

mean 76.4 and

9.7.

This histogram of the actual distribution shows, on the whole, only fairly close agreement with the normal distribution, calculated for the sample of 1,060 individuals, with a

mean

pulse beat of 76.4

and a standard deviation of 9.7 beats. For a test of normality see Chapter 10 and Problem 10.1. The distribution as a whole has also been divided by erecting perpendicular lines or ordinates to the * axis at the

mean

of the

.

Introduction to Biostatistics

22

and at intervals of 1, 2, and 3 s.d.s on either side of the mean. These lines divide the area under both the curve and the

distribution

histogram into

six parts

— obviously

unequal

in area.

The area

be-

tween any two ordinates, however, is approximately the same under the normal curve as under the histogram. The area under the curve, therefore, could be used in place of the area under the histogram

and from

this the

number of observations

falling

between any two

ordinates could be estimated. It

should be recognized that the normal curve drawn for a partic-

depend on the value of the mean, the standard and the number of observations in the distribution. Thus there will be an infinite number of normal curves. Methods of estimating the area of a rectangle or a circle are easy but the methods of computing the area under the normal curve between any two ordinates are not so simple. For that reason, areas for the standard normal curve with mean 0 and standard deviation of have been calculated and tabled. In order that these areas may be applied to any specific curve, the x scale of the curve has been transformed into what is often referred to as a standard measure or relative deviate scale, x' This means that any given value is expressed as a number of standard deviations from the mean. For example, in a curve with /a = 10 and a a = 2 the value, x = 8 would be represented as one standard deviation from the mean since x' = (x — )/

if

skews to the

.

.

right,

and

it

is

0.5.

the distribution depends

/n,

p.

the underlying probability

on whether we consider the

n/n, the rate scale, or the “success” scale,

n.

of rates ( a/n )

=

of successes (a)

p.

=

np.

In Table 7-1, the variable *

is set

computed mean rate for each p. The standard deviation of a

equal to a/n or a/6, showing the

fohows the

single obser vation th at

V

binomial distribution with parameter p is p(l — p) = Vpq. From the rule of standard errors of the mean for n observations we obtain:

Standard error of rate {a/n)

Standard error of success ( a ) Table

which

is

7-1 gives also the results of the

= =

\^pq/n

Vnpq

summation

the variance of the rate distribution.

(x

It is easily

—

x) 2 -f(x),

seen that for

each p, the variance is pq/n. Similarly, the table shows that the variance of “successes” is npq. That the distributions for p = 0.1

and for p

=

0.9

have the same variance or standard error squared

follows naturally from the identity, np

In summary, is

we

(1

—

p)

=

n{ 1

—

p)p.

find that the distribution of rates or proportions

defined by one parameter, the underlying probability, p, size, n, is given.

when

the

sample

The normal distribution needs two independent mean and the standard deviation for specification. Knowledge of the underlying

parameters, the

distribution enables us to estimate

the probability that an observed rate, a/n,

is

a sample of a universal

assumed p. We compute the probability that a or fewer events would occur, assuming the probability of a

distribution of a given or

single occurrence to be p.

7.

Binomial and Poisson Distributions

Two methods

69

are used: the normal approximation and direct first is used for larger samples, the other

binomial expansion. The

for small samples with a few or

no events.

NORMAL APPROXIMATION Larger samples of the binomial distribution follow the central limit

theorem, which states that the distribution of a approximates

a

deviation

V npq

This approximation

if p is 0.5.

Larger

normal distribution, with mean np and standard is adequate with rather small n’s n’s are necessary the more different p is from 0.5. 7-3 compare the binomial and normal distribution n = 10 and n = 40, respectively. In Figure 7-2, the

Figure 7-2.

.

Figures 7-2 and for

p =

normal

Comparison of normal curve and “binomial curve’’ for (1/2

0.5

and

distribu-

1/2) 10

.

and standard deviation V2.5 (npq = 10 X 0.5 X 0.5); in Figure 7-3 the mean^ of the normal distribution is 20 and the standard deviation is VlO. In general, the tion has

mean

5,

(np

=

10

X

0.5)

Introduction to Biostatistics

70

approximation to the normal distribution np and nq are larger than

is

sufficiently close if

both

5.

In estimating the probability that an observed proportion a/n

is

from a theoretical or assumed p we use the normal variate is equivalent to t with infinite degrees of freedom (marked oo ). x' is formed using the number of successes, a and its mean and standard deviation based on p. Or, equivalently, a/n and its mean and standard deviation can be used. different

x',

which

X'

=

a foO

—

np

_ a/n — p

V npq

'Jpq/n

Example. In the year 1953, 1,523 deaths

in the

United States were

and 749 females, 0.5082 males. Assuming that p (the a proportion of 774/1,523 proportion of males in the universe) is 0.5, one can ask if the observed

attributed to rheumatic fever. There were 774 males

=

number of males

is

unlikely. It follows that:

Mean = np —

1,523

X

0.5

=

761.5

Standard error

=

V npq = V 1523 X

-

774

/oo

-

761.5

=

0.5

X

0.5

19.5

0.64.

19.5

Since the observed

t

does not exceed

1.96,

the 0.05 value (see

no evidence to reject the null hypothesis, which in this case amounts to stating that there is no sex preference in deaths caused by rheumatic fever. bottom

of Table B), there

line

When p limits for

is

not

known but

is

is

to be estimated

by a/n 95 ,

% confidence

are given by the formula

p

a/n

±

n

2

BINOMIAL EXPANSION When p

is

very different from 0.5 and

its

distribution

skewed, an approach through the normal distribution

For example, with a/n is

v

0.02

X

0.98/150

is

is

heavily

inaccurate.

= 0.02 and sample size 150, the standard error = 0.0114. However, subtraction of 2 standard

Binomial and Poisson Distributions

7.

Figure 7-3.

errors

71

Comparison of normal curve and “binomial curve’’ for (1/2

from a/n or 0.02

—

0.0228

=

—0.0028 leads

value for the lower confidence limit of p which ,

+

1/2) 40

.

to a negative

absurd.

is

In such circumstances, the frequency of each event and confidence intervals

can be found

in tables.* Alternatively the

formulae of

this

chapter can be used for direct calculation. Making these calculations is

currently eased considerably by use of electronic computers, but

often a desk calculator with sufficient digits and multiplication and division facilities can yield the result in shorter time than

takes to

it

obtain “time” on a large electronic computer.

Returning to the question of the confidence limit of p when a/n = 3/150, one can first approach the problem by assuming the observed rate

of

is

the “true” probability and proceeding to find the frequency

0, 1, 2, etc.,

occurrences out of 150. By adding each frequency in

sequence the accumulated frequency distribution us which

number of

bution. Thus, with

events are outside the

p =

0.02 and n

=

150

95%

we

results,

which

tells

limits of the distri-

find:

* See, for example, the Handbook of Tables for Probability and Statistics edited by William H. Beyer, Chemical Rubber Co., Cleveland, 1966. ,

0

/(0)

1

/( 1)

2

/( 2)

3

/( 3)

m

6 7

/( 7)

= q 150 = (0.98) 150 = /( 0) 150 p/q = /(l) 149/7/2^ = /(2) 148 p/3q

= = = =

0.048296

0.048296

0.147845

0.196141

0.224785

0.420926

0.226314

0.647240

= f(5)\45p/6q = /(6) 144 ppq

= =

0.049886

0.968009

0.020943

0.988952

Thus one can determine probability

95%

assumed “true” the observed a

The

Accumulated frequency

Frequency /(a)

Events (a)

direct

observed rate

if

p =

that events between 0

p's will lead to a confidence interval for

=

p based on

3.

computation of probability is

and 6 occur with

Repeating the calculation for other

.02.

either

is

when

simplest

0/n or n/n. The probability

is

the

then q n or

p

n 9

respectively.

In the example given in Table 6-2, there were seven patients in

whom

improvement in breathing was attempted. Six patients did show improvement and one had the same measurement before and after treatment. Let the null hypothesis

be that the probability of

one patient’s condition staying arrested or being improved is 0.5 and the probability of his worsening is 0.5. Then the probability of seven patients’ not worsening

is

(0.5)

7

=

hypothesis can then be rejected at the 0.01

Accepting a treatment probability of bility

effect

is

which the solution

The

null

level. is

the least

95%

its

proba-

expressed as follows 1

0.05

is

^ V O05 =

p ^ antilogy ^

p ^ The answer

0.0078.

improvement that contains 7/7 within

limits?” This

P

=

one can then ask “What

p ^ for

1/128

is

antilog (—0.18586)

0.65 that there

is

evidence that the patient’s chance of im-

provement is at least 0.65, or that his “odds” in favor of improvement are 2/1 or better. Chapter 16 will contain examples of utilization of the binomial distribution in estimating the sample size, n, for given p .

72

7.

Binomial and Poisson Distributions

73

THE POISSON DISTRIBUTION OR LAW OF SMALL NUMBERS A

frequently encountered circumstance

an event

is

very small but that even

number of events

small

counting blood

if

that the probability of

the sample size

is

are observed. For example,

drawn and

millions of cells are

cells,

is

large only a

when one

is

diluted so that

only a few cells out of the millions are likely to be counted in the

microscope

Counts of

grid.

per square in subdivisions of the

cells

grid are following a certain distribution called the Poisson distribu-

named

tion, so

The with

scale

French mathematician who first described it. an integer scale 0, 1, 2 that theoretically ends

after the

is

.

.

.

the large sample size, but in practice only low frequencies

n,

are observed.

The mean of

the distribution

that of the binomial

and

m

=

is

is

a noninteger which

usually called

L a X/(a)/Z/(«)

If p is 0.00001 and n = The standard error is

=

mean

200,000, the

is

identical to

m\ nP-

is 2.0.

from that of the binomial

also derived

distribution in a special case

Since p

is

very small,

1

a

=

—

p

m

=

V np is

(1

— p).

almost identical to

1

and the formula

becomes cr

The standard

V np

= Vm.

error of the Poisson distribution equals the square

its mean, or the mean and variance are equal. The expected frequency can be derived from the recurrence formula

root of

for the binomial.

We

had

/(“+»= /w I— Since q

is

almost identical to

1

and a

is

very small in comparison to n ,

the formula reduces to

For the

first

subset, 0,

we had /( 0 )

=

(1

- p)n

.

74

Introduction to Biostatistics

This reduces, approximately, for small /( 0)

where e It

is

=

'p’s

e~ n P

to

=

e~m ,

the base of the natural logarithm.

follows, then, that the Poisson frequency distribution

f(a)

The data presented

in

=

is

e~mm a /{a\).

Table 7-2 were gathered from the Annual

Report from the State Department of Public Health, Pennsylvania, is a rare disease for which the annual

in 1960. Multiple sclerosis

Table 7-2. Distribution of Medium-Size Pennsylvania Counties by Number of Deaths from Multiple Sclerosis (1960)

No. of deaths

No. of counties

Expected no.

0

18

19.4

1

13

10.9

2

3

3.0

3-1-

0

0.7

34

34.0

Total

m =

19/34

=

0.559.

is about 4 per 100,000 population. The 34 Pennsylvania whose population is between 15,000 and 90,000 reported 0, 1, and 2 deaths from multiple sclerosis in 1960. The number of counties so reporting were 18, 13, and 3, respectively. The mean of the distribution is then estimated to be per county

mortality counties

w= We can

(0

X

+ X

18

then expect

1

—

if

13

+

X

2

3) / 34

=

19/34

the Poisson distribution

is

=

valid

0.559.

— to find the

following frequency:

0 deaths 1

/( 0)

death

/( 1)

2 deaths

/( 2)

and the remainder, /( 3+) If

no

3 or

=

= = =

34

19.4 10.9

more

34.0

-

X

of Table 8-1, taken from th t Journal of the American Medical Association,

145:14, 1951.

Table 8-1. Results of Medical Management

in

101

Cases of Massive Gastric Hemorrhage

No. cases

No. deaths

% dying

1930-43

40

10

25.0

1944-49

61

5

8.2

Period

Source:

The authors

J.

state,

A.

M.

A., 145:\A, 1951.

“The

chief factor,

we

believe, that will explain

the decrease in the mortality rate from 25.0 to liberal use

of transfusion.” The

this difference

frequently in

first

8.2%

is

the

question to be answered

more is:

Is

between 25.0 and 8.2% of a magnitude that will occur samples coming from the same universe? Here the

80

Introduction to Biostatistics

fatality rate

there

is

a

of persons with gastric hemorrhage

common

is

not known, but

case-fatality rate, the best estimate of

it

is

if

that

based on the sum total of the data.

hemorrhage

14.8% of 101 persons with massive gastric Hence the estimate for the universe p is 0.148. The

15 or

In this case

died.

calculations necessary for the test of significance are therefore: a,

n

X

x

2

= = _

a2

10

40

n2

X

(10

61

40

= =

A = N=

5

61

- 40 X X 61 X

15

101

- 50.5) 15 X 86 5

2

101

13,053,265.25 3;

=

147,600

4.15

(p

This analysis shows that

=

1)(M

1

-

-L+

)

1)

2)

-

JL-rXM

P(r) (r

+

1)(^

~L+

r)

(r

+

1

))

A

most dramatic test of the efficacy of hyperimmune rabies serum (gamma globulin) was performed in Iran (Bull. World Health Organ., 75:747-772, 1955). Seventeen persons had been bitten in the head or neck by the same rabid wolf. The standard Pasteur vaccine treatment was given. In addition, 12 persons received one or more doses of antirabies gamma globulin. The results are sumExample.

marized in Table

8-4.

—

Table 8-4. Deaths from Rabies with and without Added to Pasteur V accine Treatment

Gamma

Globulin

Persons bitten by rabid wolf

With

gamma

globulin

12

Pasteur treatment only

5

Dead

Alive

1

11

2

3

Source: Bull. World Health Organ., 13:141-112, 1955.

Obviously the numbers are very small and the x 2 test would be of no validity. With Fisher’s exact test, the assumption is that only the occurrences that point to benefit of the new treatment ulin) are of interest.

Are one or no deaths out of

(gamma

glob-

12 an unlikely

occurrence compared with 3 out of 5?

The and

smallest marginal total

D=

17

-

12

=

4

=

L\

M=

12 since 1/12

5.

P( 0)

P( 1)

P( 0)

is

+ P{ 1)

= J.X 3X4X5 14 X 15 X 16 X 17 476 25

476

X

4 1

=

X 12 X 2

0.053

24

476

476

Q.

120 -

110

-

—

i 1

1.40

1

1

1

Surface

Figure 9-6.

1

1 1

2.00

1.80

1.60

area

In

square

i

2.20

'

r 2.40

meters

Relationship between total circulating protein and surface area

with regression line,

y =

40.604

+ 72.169*.

9.

Correlation and Regression

99

the corresponding value of

such as 2.4

is

y

is

obtained.

* and

substituted for

two paired values for x and y plotted on the diagram and the

Then

the value of

and

156.1)

(1.6,

a second value of

y

is

(2.4, 213.8) are

connecting them

line

x

obtained. These

is

now

the regression

line.

What do

and b indicate? The value of b indicates that for each square meter change in surface area of the body, the circulating protein increases (because b is positive) by 72.169 gm. The a value is the value of y when * is 0 that is, the intercept of the regression line and the y axis. This regression line, fitted to the the values of a

—

62 points of the scatter diagram,

is

pictured in Figure 9-6. There

considerable scatter of the dots about the

A

measure of the variation of these points around the

determined.

.

x.

line

can be

known as the standard error of estimate and usually The standard error of estimate measures the agree-

It is

denoted by s y

is

line.

ment between the y values observed and those predicted by the regression equation from the observed values of a:. If, as is usual, yi is an observation on the dependent variable, Xi the associated observation on the independent variable, and Yi = a + bxi is the value of the dependent variable predicted for equation, then the definition of sy

Sy.x

.

jc,-

by use of regression

x is:

— (a + bXjf\ n — 2 are n — 2. The term

under

as the residual variance.

Com-

2

[yi

—

and the associated degrees of freedom the square root sign

known

also

is

putational formulae are:

Ly -(Ly) /n-[Lxy (L*XEt)/”]7[I> - (I:*) /n] 2

Sy.x—

2

2

2

n-2

= sy V[l-r

2

][(n-l)/(n-2)].

The sample statistics a b like the sample mean, follow normal distributions when the original sample of j\s is from a normal distribution. The standard error of b is given by the formula: ,

,

= SyjVz x 2 - (£ and

this

xy/n,

provides a simple test of the null hypothesis that the true

slope, estimated

by

/;,

is

zero.

The quantity

t

=

is

compared

to the

1

100

Introduction to Biostatistics

cut-off points of the

value of

distribution (Table B, p. 208). If the calculated

t

beyond

lies

t

either cut-off point, b

nificantly different (statistically) level.

Other standard

from zero

declared to be sig-

is

at the

chosen probability

based on the standard error of

statistical tests

estimate are available. For instance, confidence limits can be

com-

y value predicted from an * value by the regression equation, and two or more regression lines can be tested for paralputed for the lelism.

For these additional

mediate

statistics text

tests the

reader

referred to an inter-

is

such as the Snedecor and Cochran volume

referred to earlier.

For the example on circulating

protein,

y and body surface ,

(Table 9-2), the standard error of estimate

a

/

-

2 000,748 ,

( 11

,

016)762

-

[

21 050.38 ,

-

(1

>

Sy

is

17 76 )( 1 .

,

.

area,

x

=

x

-

016 )/ 62 ] 2 /[ 225.4290

(

117 76 ) 762 ] .

60

=

23.903

with 60 degrees of freedom and the standard error of the slope s b 56

Thus, to

=

23 903 / •

test the

5%

/(0.05)

=

bjs b

cut-off points

=

•

4290

-

=

(117-76)762

18.013.

observed slope against zero, we calculate /

The

V225

is

2.000 and

=

72.1686/18.013

of the

— /(0.05) =

=

4.0.

distribution with 60 d.f. are t —2.000. Since the calculated value

%

cut-off point, it is concluded that there is exceeds the upper 5 significant evidence that the slope of the true regression line is not

of

t

zero

not horizontal) and a trend

(i.e.,

it

if b had would have been

(Notice that

exists.

been of the same magnitude but negative,

also

declared to be significantly different from zero because then the t value would have fallen short of the lower cut-off point. Another way of stating the test is that for b to be significant at the chosen level the calculated value / = b/s b must exceed the tabulated

calculated

cut-off point for

t

in absolute value,

Although the correlation cient,

b

,

i.e.,

coefficient, r,

without regard to sign.)

and the regression

are different statistics, the arithmetic values of

significance tests of each are always identical.

obtained for

r in

the previous example

was

Note t

=

that the 4.0, as

coeffi-

for the

t

/

value

was the

above t value for b. Complete data for calculation of a regression line involving only 10 points are given in Table 9-3. The two variables involved are age

—

Table 9-3. Death Rates (per 1,000 Population) by Age United States, 1965

—

Log io

Coded age

Death

death rate

Age

(*)

rate

(y)

40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85 and over

0

3.7

0.57

1

5.8

0.76

2

9.1

0.96

3

13.9

1.14

4

20.6

1.31

5

31.7

1.50 1.66

= = x = =

6

45.5

7

68.1

1.83

8

106.8

2.03

9

202.0

2.31

n

10

E*

45

'Ey =

4.5

y

285

Exy =

nExy -

,

ExEy

_

Ey

=

y

-

=

bx

1.407

-

14.07 1.407

22.6513

78.6400

- 45(14.07) - (45)2

10(78.64)

»Ex*-(E xy a

2

= =

10(285) 0.1858(4.5)

=

0.571

Source: Vital Statistics of the United States 1965, Vol. II Mortality Part A U.S. Department of Health, Education

—

and Welfare, Public Health Service (Washington,

and the 1965 United States death

1967).

rate per 1,000 population.

“y-shaped” curve of death rates plotted against age

However,

it

The

complex.

is

has been observed that for age 40 and over there

is

a

regular increase which can be well expressed by a linear regression

of logarithmic death rate on age.

The independent

expressed in five-year increments from 42.5 is

;

the logarithm (base 10) of the death rate.

variable, a,

is

age

the dependent variable

The

scatter

diagram

is

shown in Figure 9-7 with the fitted regression line drawn in. The necessary sums are given on Table 9-3 and the computational formulae for a and b

yield the estimates

shown on

the table

and the

equation log death rate

The

=

0.571

+

slope, b } represents the

,

0.186a

where a

number of

=

age

—

42.5

^

log units the death rate 101

102

Introduction to Biostatistics

changes with a five-year increase in age so that b/5

=

0.03715 gives

the annual change in log death rate, or, after converting to the death rate scale (with antilogs) the annual percentage change,

in death

c,

rate,

c

As

= -

100 [antilog (b/5) 100 [1.089

-

1]

—

1]

= 8.9%

per year.

is drawn by caland connecting them. For example if x = 0, y = 0.571, then the death rate = antilogy = 3.73 and if x = 3, y = 1.28 and the death rate = antilog (y) = 13.44. As a check, the point (x, y) should also fall on the line drawn.

for the previous example, the regression line

culating two points

1,000

Figure 9-7.

Linear regression

PER

of

logarithmic

death

rate

(United States, 1965) on age.

RATE

DEATH

possible to minimize the

It is

tions

x

deviations instead of the

y

devia-

by interchanging the roles of the x and y variables. This means

using the former dependent variable as independent and vice versa,

and represents a fundamental change problem. The resulting

y

=

a

+ bx

line,

in

+

the formulation of the

By

is

not identical with

(except in the case of perfect correlation) because dif-

ferent assumptions have been

imized.

x = A

The equation x = A

made and

+ By

is

different deviations

called the regression of

min-

* on y.

:

Correlation and Regression

9.

PROBLEM

103

9-1

Murphy and Gardner (unpublished lected

data, University of Pennsylvania) col-

from normal donors, labeled them with chromium-51, and them into patients who had no circulating platelets of their own because

platelets

injected

of acute leukemia or aplastic anemia. recipient (as percent of the injected

scopic enumeration

The

yield of platelets circulating in the

amount) was determined by

and isotope counting. The

result of 15

direct micro-

double measurements

are given below

Enumeration

1.

Find by the

y

=

a

+

(*)

Isotope count (v)

98

86

65

68

78

78

60

88

55

31

26

27

49

29

35

30

least squares

62

50

41

43

45

44

76

51

20

20

2

2

1

1

method

the parameters a

and b

in the

bx.

measurements y against x and draw the computed line. 3 Determine the standard error Sb of the slope. With the t test, are the following hypotheses accepted or rejected 2

.

equation

Plot the

.

at the

5%

level ?

4

.

Slope

=

0

5

.

Slope

=

1

methods

This was the investigators’ working hypothesis, i.e., that the two yield equal results. (See List of Answers to Problems.) .

PROBLEM

9-2

Could you calculate the correlation coefficient measuring the relationship between blood pressure (systolic) in white males and white females for the age group 50 to 54 years? If not, why?

Plasma Volume and Total Circulating Albumin

for 58

Normal Males

Total

Plasma volume Individual

in cc

number

(x)

albumin

gm

Individual

in cc

(y)

number

(x)

in

circulating

albumin in

gm

(y)

1

2,575

119

30

2,790

133

2

2,896

133

31

3,007

153

3

2,429

121

32

1,972

91

4

2,552

129

33

2,525

116

204

5

3,213

146

34

4,082

6

2,921

146

35

2,326

118

7

3,607

182

36

2,371

112

8

3,142

145

37

2,832

118

9

2,524

116

38

3,170

144

10

2,599

118

39

2,244

102

11

2,900

136

40

1,908

87

12

2,802

143

41

2,333

98

13

2,508

139

42

2,946

126

14

2,642

144

43

2,723

122

15

3,219

163

44

3,555

169

16

3,307

146

45

3,114

153

17

2,210

100

46

2,635

133

18

2,817

130

47

2,646

115

19

2,711

118

48

2,330

105

20

2,191

106

49

2,065

87

21

2,818

141

50

2,540

117

22

3,187

153

51

3,145

147

23

2,726

144

52

2,242

94

24

1,989

106

53

2,570

116 120

25

2,526

131

54

2,700

26

2,491

120

55

2,815

129

27

2,366

124

56

3,130

139

28

2,950

143

57

3,070

133

29

2,048

93

58

2,206

97

For these data: £ *2 = 434,974,920

104

Total

Plasma volume

circulating

£x— £y = 2

156,868 977,981

Zy = £ xy =

7,413

20,579,638

9.

Correlation and Regression

PROBLEM

105

9-3

Could you calculate the correlation

coefficient

between systolic and diastolic blood pressure

How

does this problem

PROBLEM

differ

in

measuring the relationship

white males of age 50 to 54?

from that expressed

in

Problem 9-2?

9-4

For 58 normal males, plasma volume and circulating albumin were determined (Ann. Surg., 727:352, 1945). These measurements are given at 1.

2. Is

there evidence of a relationship between circulating albumin and plasma

volume? 3.

left:

Plot these data in a scatter diagram.

If there

is, is it

linear in character? Positive or negative?

Calculate the value of the correlation coefficient showing the relationship be-

tween plasma volume and circulating albumin. 4. Is the correlation coefficient significant? 5.

Find the equation of the regression line that would be used to predict from plasma volume.

culating albumin 6.

Calculate the standard error of estimate for this regression

line.

cir-

E

Goodness of Fit

“Goodness of

fit” is

a generic term for tests that determine

if

the

observed frequencies in a distribution agree with those that would be expected according to some hypothesis.

The test compares the observed whole numbers (Oi) with expected numbers (E ) that because of computational circumstances are not necessarily integers. It is assumed that the observations are classified into k mutually exclusive and exhaustive categories so that i = 1,2, x

.../:.

The underlying formula X

2

=

is:

E (Oi -

Etf/Ei,

i

which has degrees of freedom (d.f.) where d.f. number of parameters estimated from the data

The

= k—

1

—

/,

/

is

in calculating

measure squared. The asThe whole numbers Oi are assumed to be samples of a distribution with mean Ei and the basic element

sumption

is

is

actually a standard

that of the Poisson distribution.

Poisson standard error 106

V

x

(see

Chapter

7).

(Oi

— E^/VEi

is,

:

10.

Goodness of

Fit

107

therefore, a standard

measure and (0,

— E) /E 2

2

{

is

approximately

The summation of these x s follows the 2 tion of x wit! 1 a number of degrees of freedom that are the formula above. It is used commonly by the geneticist distributed as xi

how

mining

is

distribu-

-

closely

genetic pattern. distribution

2

It is

given by in deter-

an observed distribution follows an expected used frequently to determine whether a particular

sufficiently like a

normal distribution so that the mean,

standard deviation, and table of areas of the normal curve used in describing

may be

it.

As an example, consider

in

Table

10-1

patients with evidence of gastric ulcer that

the distribution of 200

was

later

confirmed by

x-ray examination, according to total acid content of the stomach

following a stimulating dose of histamine.

and the standard deviation, sXi

is

The mean,

x,

is

100.4,

22.41.

Table 10-1. Distribution of 200 Patients with Gastric Ulcer Diagnosed According to Total Acid Content of Stomach Following Stimulating Dose of Histamine

Units of total acid

140 Total

Source

Observed (O)

Expected (E)

(O

O-E

- Ey E

6

7.2

-1.2

0.200

12

10.2

1.8

0.318

-1.9 -1.3

0.060

0.191

17

18.9

27

28.3

36

33.8

2.2

0.143

34

34.9

28

28.3

-0.9 -0.3

0.003

0.023

23

19.7

3.3

0.553

13

11.0

2.0

0.364

4

7.7

-3.7

1.778

200

200.0

0.0

3.633

Unpublished data from H. Bancroft.

The expected

values are

computed by determining the proportion

of the area of the normal curve that

falls

between the beginning of

and then multiplying these proportions in each group by «, the size of the sample. For example, to determine how many fall between 70 and 80 units of acid, the value x' = (70 — x)/sx is calculated and the area at x' is found from a successive class intervals

108

Introduction to Biostatistics

detailed table of areas of the

normal curve. Interpolation but slightly different

(p. 207) will give satisfactory

instance, (70

below

there

=

100.4)/22.41

Table

A

In this

—1.36. The proportion of the area

same way (80 — 100.4)/22.41 = —0.91. value of x’ is 0.1814. The area between 70 and 80

this x' is 0.0869. In the

The area then

—

in

results.

is

for this

0.1814

—

0.0869 or 0.0945. Since there are 200 in the group,

would be 200

X

0.0945 or 18.9 persons with total acid between

70 and 80 units. Expected frequencies in other class intervals are

determined by the same method.

Examination of Table 10-1 shows some difference between the observed and expected frequencies in every interval. The x 2 test is used to determine whether the observed frequency distribution is sufficiently different from the expected to reject the hypothesis that the sample comes from a normal distribution. The value of x 2 is calculated by using the formula

L (O -

2 X =

Ef/E.

To determine

the degrees of freedom here, the following is conThere are 10 pairs of frequencies for comparison in the analysis, (k = 10) and two parameters were estimated, x and sx sidered.

;

hence,

d.f.

exceed the

=

10

5%

—

—

1

2

=

The

7.

cut-off point, 14.067,

=

calculated xi 2

and the

fit is

considered to be

adequate. If the associated probability had been low 5 %), this

would have been evidence of a poor

As an example of

fitting genetic theory,

genetic genotypes of

gamma

3.69 does not

(e.g., less

than

fit.

consider Table 10-2.

globulin Gm(tf) and

Gm(b)

The

are co-

dominant so that phenotypes aa, ab, and bb can be identified. The two genes appear with different frequency in the population. Let the frequency of Gm(o) be p and that of Gm(Z>) be q so that p + q = 1. In a population with random mating the probabilities of one person possessing phenotypes aa ab or bb are p 2 2 pq and q 2 respectively. ,

,

The frequency of gene

,

,

,

,

patterns in married couples

is

similarly derived

from the expansion of ( p 2 + 2 pq + q 2 ) 2 The derived probabilities are given in the second column of Table 10-2. In order to assess the gene frequency in the sample with maximum likelihood, one considers the total of 4 n or 996 genes. Of these, Gm (a) occurs as many times as the letter a occurs in the mating pattern. For example, in the pattern ab X bb it occurs once, but four times in the pattern aa X aa. To estimate p, one adds the .

,

10.

Goodness of

Fit

109

product of gene occurrence and observed pattern frequency and sum with 4 n. Thus the estimate of p is

divides the

+

58

+

(306

+

58

+8+

+q=

1,

=

p

and that of q

Note that p

when p

is

24

+

16

+

30)/996

-

0.2309,

364)/996

=

0.7691.

is

=

q

(102

estimated

30

+

so that a separate estimate of q is unnecessary for purposes of checking one’s arith-

— except

metic.

Table 10-2. Distribution of Phenotypes of Gamma Globulin Factors Gm(«) and Gm(6) in a Random Sample of 249 Married Couples

Probability of occurrence

Mating (phenotypes)

(fi)

X X X X X X

4 pq 4 p2q2

ab ab ab aa

aa bb

3

bb

ab aa aa bb bb

=

5

—

1

Expected

m

frequency (Et

(O

t

= n Xfi)

P

1

EiY

102

104.6

0.065

29

31.4

0.183

0.317

4

2pV

—

Ei

4 P3q

Total

d.f.

Observed frequency

=3

15

15.7

0.031

91

87.1

0.175

249

249.0

X2

=

0.771

P > 5%

Estimate of gene frequency of Gm(a):

p

=

[(102)

q

=

1

+

-p =

Source

.

Little,

Brown

2(29)

+

3(8)

+ 4(4) +

2(1 5)]/(4(249))

=

230/996

=

0.2309

0.7691 in D. W. Clark and Company, Boston, 1967.

A. G. Steinberg

&

The expected in the first row,

B.

frequencies are then

MacMahon, Preventive Medicine

computed

as

E = t

nf

t

,

;fi is

p. 107.

4 pq 3

and

E = (249)4(0.2309)(0.769 1) = 104.6. The sum of the E$ should equal the sample size, n. Before proceeding to compute x the expectancies E 3

l

2

>

t

(in the fourth

10-2) should be inspected for low values. The approximation to the x 2 -distribution is not good if any expectancies are less than 5.0, and it is unacceptable with expectancies less than 1 .0.

column of Table

110

Introduction to Biostatistics

X

In this example, the pattern aa

aa

expected to occur 0.7 times

is

out of 249. The usual circumvention of this dilemma

is

to

groups of small expectancies until the sum of expectancies

X

ably high. In this example, patterns aa

mutual expectancy of

to yield a

9.5

+

aa and ab

0.7

=

loss in degrees of freedom. Originally, there

10.2.

were

X

combine

is

reason-

aa are added

The

effect is

a

six subsets in the

With the collapsing of two groups into one, there are — 1 = 3, final degrees of freedom are then 5 — because only one parameter had to be estimated from the data. The 2 X is found to be 0.771, which is much smaller than the 5% cut-off distribution.

only five subsets. The

1

point for 3 degrees of freedom. (7.82).

It

that the sample distribution of couples

a

random

.

can therefore be concluded

fits

the genetic hypothesis of

distribution of a population with the stated gene fre-

quencies.

mX

n CONTINGENCY TABLES

An m X

n contingency table presents the frequencies of joint

occurrence of two attributes that each have more than two subsets so that there are

m

columns and n rows of frequencies. Totals for

each row and for each column are given. The usual null hypothesis tested

is

that each

row has

the

same

distribution as the

row forming

the total of columns or, equivalently, that the distribution in each

column

is

the

same

as that in the

column

for

row

total.

We

have already considered significance tests for differences of two or several rates (see Chapter 8). These problems can be viewed as 2 X 2 or 2 X n contingency tables having two columns for a

dichotomous

and two or more rows for groups. The

classification

following test for an

mX

n contingency table

equivalent to the

is

(uncorrected) tests previously presented for the special case

m=

when

2.

Let the frequency in the

sum of

zth

frequencies in the /th

in the y'th

column be

row and yth column be O ij9 and let the row be r», and the sum of frequencies

Finally, let the

Cj.

n

m

1

1

sum of

all

frequencies be

N = E r, = E c,. The expected frequency Eij

=

for the observed Oij

Cj

X

rj/N

=

Vi

X

is

Cj/N.

then

Goodness of

10.

We

Fit

111

agreement between the observed and expected

test the

fre-

quency by

X

E (Orj -

=

2

2

Eij) / Eij.

compute expectancies, two

In order to

of parameters are

sets

needed: the two sets of marginal probabilities, r

each

set

one probability

that the total

(m

—

1)

+

the others are

= m X

mX n

{

—

n groups: 1

—

[(m

—

+ (n —

1)

1)1.

term can be simplified to read, for the special case of n contingency table, as follows:

right side

mX

d.f.

= (m -

\){n

-

1).

For example, consider the data of Table is

and c /N. In computed so

number of independently estimated parameters is 1). The degrees of freedom are then calculated by

d.f.

an

when

given

/N

—

(n

the usual formula, given

The

is

{

classified into three subclasses, as is

Stage of gestation

10-3.

blood

loss,

making

a total of

nine compartments.

Table 10-3. Hemorrhage in Premature Separation of the Placenta, by Amount of Blood Lost and by Stage of Gestation

Observed frequencies (On)

Total blood loss (ml)

1,000

6

23

22

51

0.228

Total lor columns

(cy)

Ci/N

1,000

7.5

22.5

21.0

51.0

33.0

99.0

92.0

224.0

Total

*

(rn" VO >n 00 ^3- ov ro ro (N (N 'vT

o O O o

m

O

OS hr

r*1

O

1950.

Cancer

00

^OO\

00

O(NMM ro

fN SO from

1920,

\t-

for

s/:

cj

« o © _ E o

Q

Rates

— — — SO unfNOsn^ro - ro « (N O SO — OO tN

Statistics

f\|

Vital Mortality

soooos^O'Or^'O" Os f^ (N fi 'j fO and

^

-loose and

—

o

es

xs numbers.

© | © ©

Reports

Ct'fi^tooOfO'tfN OOssor^'OroOS'Or^

G.

Cancer,

whole

Population

SOfOfOOsn'OQi’O — so — so — — O^noo fN oor^OfOfNSO^O wf O OS oo rT OO r? MSOOMOC'tOOO

from

to off

r-’

x

©

rfrosOr^r^'O'fl^'M

C3

Deaths

1950

Census.

©

/5

*s S3

GO

rounded i

3

and

21920

so O' oo OO — so •— fN — sO — — «o 00 «o O © O o — OO oo © — oo (+)•

Prevalence of disease

Specificity

= TN/Z>( — ).

Positivity of procedure

D(-f)//i

= P (+)///.

Total

(//)

a

— 146

Introduction to Biostatistics

Sensitivity

is

the proportion of true positive

Specificity

is

the proportion of true negative

In order that a procedure can at

given disease one must If this

sum

sensitivity

excluded

demand

among diseased. among nondiseased.

be of diagnostic value for a

all

that sensitivity

+

specificity

2.0, the test is ideal. If a screening test is

is

approaching

1

.0, it

among persons

means

>

that disease can practically be

with a negative

Effort for definite

test.

diagnosis can then be concentrated on persons with positive

The

yield of true positives will

depend on the

specificity

cedure and prevelance of disease. The tuberculin skin ple,

1.0.

one with

has high sensitivity, but the positive

test includes

tests.

of the pro-

test, for

exam-

those with past

symptom-free infection, partially developing disand active clinical tuberculosis. In comparing two diagnostic procedures in the same disease, sensitivity and specificity are important measurements. The statistical test infection, present

ease,

that evaluates the significance of observed differences in sensitivity

and

specificity

between two procedures performed

tients is a special case

of the x

2

For example,

test.

in the

same pa-

an

article in

in

Surgery (64: 332-338, 1968), Sigel et ah compare clinical diagnoses with the Doppler ultrasound method in the diagnosis of lowerextremity venous occlusion. Venography was used to confirm or reject the diagnosis.

Naturally the clinical diagnosis was assessed

without knowledge of the outcome of the Doppler

test

or of venog-

raphy. Table 13-3 presents the findings in 44 extremities with venous

Table 13-3. Lower Extremities With or Without Venous Occlusion Distribution by Results of Clinical Diagnosis and of Doppler Test

Clinical

Doppler

Venous

Normal

diagnosis (1)

test (2)

occlusion

veins

0

0 0

+ +

27 (FiS*)

22 (S,S 2 )

4 (F,F 2 )

1

5 (S,F 2 )

44

Total

Sensitivity

Clinical:

41 (S,S 2 )

(F

3

+ +

0

F2 )

(SiFO 16(F,S 2 )

3

77

Specificity (3

+ +

22)/44

Doppler:

(16

Data from

B. Sigel et

22)/44 al.,

= =

0.57.

(5

0.86.

(27

+ +

41 )/77

41)/77

Surgery 64:332-338, 1968.

= =

0.60. 0.88.

13. Evaluation of Effectivity

and Risk

147

occlusion and in 77 extremities for which venographic examination

showed normal conditions. The usefulness of each procedure is shown by the fact sum of sensitivity and specificity exceeds 1.0. A statistical procedure

validity of the

2X2

For the

performed in each as the usual x 2

is

Chapter

tables (see

that the test for

test for

8).

clinical diagnosis

(25

X

-

77

44

X

44

X

X

77

-

31

56

X

60.5)

65

2

X

121

=

2.46

>

5%).

(P