245 11 28MB
English Pages 246 Year 1970
Digitized by the Internet Archive in
2015
https://archive.org/details/bancroftsintroduOObanc
Bancroft’s
Introduction to Biostatistics SECOND EDITION JOHANNES IPSEN,
M.D., M.P.H.
Professor of Medical Statistics and Epidemiology,
Department of Community Medicine, School of Medicine, University of Pennsylvania, Philadelphia
POLLY FEIGL, PH.D. Assistant Professor of Biostatistics,
Department of
Biostatistics,
Public Health and
School of
Community
Medicine, University of Washington, Seattle
Medical Department
HARPER New
&
ROW, PUBLISHERS
York, Evanston, and London
BANCROFT’S INTRODUCTION TO BIOSTATISTICS 1970 by Harper & Row, Publishers, Inc. All rights reserved. book may be used or reproduced in any manner whatsoever without written permission except in the case of brief quotations embodied in critical articles and reviews. Printed in the United States of America. For information address Medical Department, Harper & Row, Publishers, Inc., 49 East 33rd Street, New York, N.Y. 10016.
Copyright
No
© 1957,
part of this
FIRST EDITION
LIBRARY OF CONGRESS CATALOG CARD NUMBER! 74-106338
Contents
Preface to the Second Edition
vii
Preface to the First Edition
ix
1
Introduction
1
2
Distributions
4
3
The Normal Distribution
4
The Mean and Other Measures of Central Tendency
5
The Standard Deviation and Other Measures of Variation
6
Significance of Differences in
7
The Binomial and Poisson
8
Significance of Differences in Proportions
9
Correlation and Regression
19
Means
Distributions
Goodness of
11
Medical and Vital
Statistics
115
12
Comparison of Rate Tables
130
13
Evaluation of Effectivity and Risk
14
Techniques in Follow-Up Studies
15
Bioassay
16
Design of Experiment
Index
77
106
Fit
142 153
163 171
Appendix on Computational Methods
List of
65
88
10
Tables
55
205
Answers 217
to
Problems
213
185
26 42
Preface to the Second Edition
In revising this textbook 12 years after
its first
appearance,
we have
chosen to follow the outline and approach of the First Edition as closely as possible.
with
its
The
We
have
tried to retain the spirit of the original
emphasis on application of biostatistics to medical problems. attention to
small sample procedures and computational
The
instruction has been increased. text; the
F
test is
t
test is
presented early in the
introduced and; accurate computational formulae
are stressed.
Growing
familiarity with statistics, as evidenced
literature in the last decade,
by the medical
has motivated the inclusion of discussions
of relative risk, measures of effectiveness, and design of experiments.
We
admire the
especially
clinical
late Dr. Bancroft’s choice of pertinent
examples and although some have been updated, the bulk of
her problems have been kept.
A
list
of computational answers has
been added. Like Dr. Bancroft, we hope the book
will
be useful for
self-study as well as in the classroom.
We are indebted
to the Literary Executor of the late Sir
Fisher, F.R.S., to Dr. Ltd.,
Frank Yates, F.R.S., and
Edinburgh, for permission to reprint Table
Ronald A.
to Oliver
C from
&
their
Boyd book
for Biological Agricultural and Medical Research. Also we wish to thank W. G. Cochran, G. M. Cox, and John Wiley & Statistical Tables
Sons
Inc.,
New
,
York, for permission to reprint Table
E from
their
book Experimental Designs. Philadelphia , Pennsylvania
Johannes Ipsen Polly Feigl
Preface to the First Edition
The present textbook represents the third revision of a series of mimeographed notes prepared for use in the teaching of biostatistics to sophomore medical students at Tulane University. The apparent usefulness to, and acceptance by, the students here during the past 9 years has led the author to agree to its publication. As originally written
it
has been used in a course covering approximately 48 hours,
of which 15 are spent in lectures, the balance in supervised laboratory
work.
The
mainly with statistical methods appropriate for Frequency distributions, tabulation, graphing, use of centering constants and measures of variation, and other descriptive methods are treated in the introductory chapters. Further topics text deals
large samples.
include the binomial,
x
2 >
and the use of the normal distribution
in
sample approximate tests of significance for differences in sample means and proportions. Additional chapters deal with the large
t
tests
and
vital statistics (including the
use of the modified
life
table
techniques in follow-up studies), and the final chapter presents a brief discussion of quantal bioassay
and the Reed-Muench method.
Because the book was written for medical students and practicing physicians,
who
presentation
is
frequently have in
little
training in mathematics, the
simple, nontechnical terms.
No
mathematics beyond elementary high-school algebra
knowledge of is
required for
understanding. Since the book will be used primarily by the medical profession, illustrations in the text
and examples
at the
end of each
chapter have been taken almost entirely from clinical medicine.
The author wishes
to
acknowledge her gratitude to Dr. John
Fertig,
Dr. Alan Treloar, and Dr. R. A. Fisher and their publishers for IX
Preface to the First Edition
X
permission to print the tables dealing with size of sample, the x 2
and
t
,
distribution, respectively.
Dr. Lila Elveback has given ration of the manuscript.
many
To Miss
helpful suggestions in the prepa-
Ethel Eaton credit should be given
for her painstaking preparation of the illustrations as well as for
of the computations. To Mary Grace Kelleher, Patricia Spaid, and Arva Boesch thanks are due for the careful typing of the
many
manuscript.
New
Orleans Louisiana ,
Huldah Bancroft
Introduction
A cine
growing emphasis on the role of quantitative methods in medimakes it imperative that the student of medical science have
some knowledge of
statistics.
The medical student while
in school is
taught the best method of diagnosis and therapy. After graduation
he must of necessity depend on current literature to learn new methods of therapy, diagnosis, and prevention. Thus he must be able to evaluate for himself the results of other workers.
when
new technique or method
He must
decide
supplement or replace an older one. He must be able to answer a mother’s question about new preventive measures with as much surety as he now advises her regarding vaccination against smallpox and polio. He should be able a
shall
to give the family intelligent assurance of the prognosis for a given
Such prognosis may depend on his ability adequately to appraise laboratory findings as well as on his knowledge of the relation of age, sex, and other conditions of the patient to a particular disease. New knowledge regarding facts such as these will come to the physician through research work done by himself or by others. He must, therefore, be able to select from masses of information that which is of high caliber and which will pass rigid scientific tests. He must develop a healthy skepticism toward everything he reads.
patient.
l
Introduction to Biostatistics
2
how can
Just
a knowledge of statistical techniques assist
him
in
and foremost, he must recognize that individuals this problem? vary not only from each other but also within themselves from day First
—
to day.
A
certain
amount of
facing the physician
is
variation
is
that of determining
normal, but the question
when
a specific variation
must learn how becomes pathologic. variation in normal individuals is measured and what the range of normal variation is. He must learn that there is some error present in every measurement or count made. It is highly unlikely that two successive red blood cell counts made on the same specimen of blood
To
assess this, the student
be identical. When, therefore, does a difference become greater
will
than error of measurement? For example, a patient had a red blood
count of 4.3 million
cell
ports 4.2 million
cells.
Another patient
is
Two
hours
patient.
Does
this indicate that the
admitted with a white blood
later the laboratory reports a
Does
The laboratory today
cells yesterday.
this indicate a real rise in
count cell
is
re-
decreasing?
count of 6,000.
count of 6,200 on the same
white count? In other words,
can these differences be explained by the inaccuracy inherent
in the
method of counting blood cells? To treat his patient with the utmost skill the physician must know the answers to questions such as these. For every measurement or determination provided by the laboratory, the physician should know the variation that is a part of the method itself otherwise he will not know when a given variation represents ;
a real change in a patient.
Whenever new methods of diagnosis or therapy are introduced, the question to be answered is whether the specific new method under consideration
is
superior to the old method. Critical evaluation of
the experimental
study must be made.
Questions that must be
answered are: 1.
Were
controls used in measurement of results? If so, were
Were the experimental group and the congroup as nearly alike as possible with regard to all known factors that would affect the outcome? Were all the factors that differentiated the group identified and evaluated?
they well selected? trol
2.
Was
the difference in the results obtained greater than could be attributed to chance or normal variation?
Only when these questions have been answered can conclusions new method be drawn. It must be recognized that there
regarding a
1.
3
Introduction
are no statistical techniques available that will prove that one treat-
ment
is
than another. They
in all respects better
ference in
particular instance the difference
More
will,
however, give
no real diftreatment so that the reader may conclude whether in a
the probability of a given difference occurring
is
if
there
is
significant.
generally, critical assessment of medical information often
knowledge of basic statistical concepts. Some of the most fundamental of them are presented in the following chapters. An requires
attempt
is
made
minimum
to present concepts with a
of technical
terminology and mathematical proof. The science of medical
cannot be covered in an elementary text
—
in part
increasing complexity of the medical sciences. gator, unlike his basic-science colleagues, his patient in his
vations. strictly
He
has
attempt to make difficulty
The medical
investi-
restricted
by regard for
sound
clinical obser-
scientifically
studying one or two variables alone in a
designed investigation.
variables, the analysis of
is
statistics
because of the
Any
realistic
study involves
which may require a variety of
many
statistical
approaches and techniques. This book tests,
may
enable the reader to perform simple
but the primary objective
is
to give
him
statistical
insight into statistical
thought. Assuredly, most sophisticated data treatment will
modern research
setting
cian, but the medical
— be
man
undertaken by a professional
— in
should be on speaking terms with
worker so that he can both advise him
in
the
statisti-
this
problem formulation and
interpret the results of the statistical analyses.
2 Distributions
It
has been said that
all
knowledge
arises
through some process of
may be as simple as noting whether an event occurs or does not occur, such as the fact that an individual wears glasses or does not wear glasses. On the other hand, the obserobservation. This observation
may
more complicated procedure, such as the measurement of the amount of hemoglobin in 100 cc of blood. The characteristic being observed or measured is called a variable The vation
involve a
.
variable takes unit)
on a value
for each subject (or other experimental
under observation. Typically these observed values
subject to subject, thus providing the variability which
from the raw
differ is
material of statistics. Statistics is the art
text
we
and science of numerical
distributions. In this
are dealing mostly with sample distributions-
that
tions of total populations or universes. If the samples can be
is,
frac-
assumed
sample distribuFor example, counts of white blood cells in a sample
to represent the population, characteristics of the
tion can be generalized to hold for the total population.
the distribution of differential
of 40 healthy adult
men may
be used to describe that distribution in
the total population of healthy adult 4
men.
5
2. Distributions
Regardless of the type of variable, it must be observed, recorded, and transcribed for arrangement in a distribution. The “anatomy” of a distribution of a variable is mainly as follows: 1.
The sample
size
(/?)
is
the
number of observations of
the
variable 2.
The sample space
(x)
is
the set consisting of
all
possible values
of the observations 3.
The
classes
(.**)
are mutually exclusive
and exhaustive subsets
of the sample space
The frequency (/,) with which an observation occurs Xi is the number of observations in that class
4.
It
follows that the
sum of
in class
frequencies /* must equal n, or else a mis-
count has occurred or the scale has insufficient classes to account for all observations.
The kind of
scale to be used
depends on the nature of the variable
under observation. Enumerative
(i.e.,
count) data are on nominal
or ordinal scales.
A
nominal scale has merely descriptive
Male, female, unknown Heart disease, cancer, stroke,
Sex Cause of death
The common yes-no
scale
is
all
others
called dichotomous. Unfortunately
dichotomous end up with an additional
“not known” or some
An
example:
Class
Variable
scales intended as
classes, for
title
many class:
of that nature.
important nominal distribution
is
diagnoses of diseases and
causes of death. International agreement over almost a century has
some
produced a three-digit code which places
all
gory that not necessarily follows a
anatomical or etiological
system, but which
is
most
strict
diseases in
cate-
useful for comparative studies.
Ordinal or rank scales have also descriptive classes, but they are
arranged in order of intensity. For example: Clinical outcome dead, :
unchanged,
Many
better, recovered; or Tissue reaction
:
0, -f,
+ ++.
such scales, although seemingly subjective, are very useful in
the hands of expert clinicians and research scientists.
A measured variable is one whose observed values are recorded on a numerical scale. Such a measured (or quantitative) variable
— Introduction to Biostatistics
6
can be
classified as
being either discrete or continuous.
outcome
implies a restriction of the possible
The former
to isolated values
— for
.... A variable is continuous it could take on any value in certain intervals. if, conceptually, Continuous scales are used for measurements when fractions are real and meaningful, such as 12.1 gm hemoglobin per 100 cc, or example,
5.3
mg
Number of pregnancies
0, 1,
:
uric acid. In practice, the distinction
and continuous numerical discrete scales
may
scales
be expressed
between the discrete
not very important. Averages of
is
in fractions,
such as 1.45 preg-
nancies as a collective experience of a group of female patients. Further, a continuous scale
may
often
— for ease of presentation
appear with subsets of whole numbers
The
desirable scale for a
The
length or range of a scale
(e.g.,
height 62, 63,
.
.
.
in.).
measured variable (whether discrete or continuous) is an interval scale, which is an ordinal scale characterized by a common and constant unit of measurement. Temperature (degrees F or degrees C) is an example of a measured variable with an interval scale. An even stronger measurement scale is the ratio scale, which is an interval scale with a true zero point. Weight is a variable measured on a ratio scale. the classes.
A
dichotomous
is
scale
determined by the definition of
is
said to
have two points: one
with r classes has r points. Theoretically, some numerical scales have infinite ranges,
but in practice we usually limit the range by making
a collective class at one or both ends
— for
example, pregnancies
1,2, 3, 4, 5 or more. For a variable such as hemoglobin, classes can be made for rare high and low hemoglobin values (such as less than 4.0 gm or more than 20.0 gm). 0,
The
first
step in the art of statistics
forms that are manageable without the
The observations must be arranged in The original records may be patient experimental laboratory diaries.
A
is
the reduction of data to
loss of informative details.
distributions of each variable. histories, interview sheets, or
laborious
first
view
may
be pro-
vided on “master sheets” on which each unit (person, animal, ex-
periment)
is
entered on a line and the measured or classed variables
are entered in separate columns as numbers, letters, or signs. Looking
down
these columns, an idea of useful scales for each variable may be formed, and the arrangements for frequency counts can be made.
Counting
-f-s,
— s,
fraught with errors.
?s, etc.,
on the master sheet soon proves
A
satisfactory procedure
more
on cards that can be sorted and counted
is
to be
to put the data
— electronically or by hand.
2.
Distributions
A
7
card should contain
tifications
all
the essential measurements and iden-
of the “unit” (the subject or the experimental animal).
In small samples a blank 3- by 5-in. card will be found very satisfac-
most statistical studies will computation of averages, correlations, etc., punch cards proceed to that can be read, sorted, and further processed by electronic comtory for frequency counts, but since
puters are preferable. Since the
first
edition of this book,
such
computers have become generally available at most medical centers; thus even an elementary text on medical statistics can be based on the assumption that electronic data processing
is
available.
most important step for which the research investigator has the sole responsibility and to which he applies his insight in biomedical matters. Ideally, the resulting card is an expression of the intent and hypotheses of the Transfer of information to cards
is
a
research project.
A standard
punch card
columns and 12 rows. moves from column to 80 and as it
(see Fig. 2-1) has 80
In the punch machine the card
1
column the operator presses a typewriterlike key punch which inserts one or more holes in the appropriate row according to a standard mechanized code. Each decimal digit is a single punch in row 0, 1, 2, ... or 9. Each letter and arithmetic sign is represented by a combination of multiple punches in the same column. A set of adjacent columns is called a field and information is carried in one-column fields, two-column fields, etc. The investigator makes a code instruction or “card lay out” indicating which single or consecutive columns should be assigned for presents each
each variable available for the subject. Also, the instruction explains the coding of nominal classes by numerals. Table 2-1 is a modified
Introduction to Biostatistics
8
code instruction for a study reported by C. M. Kunin and R. C. McCormack, New Eng. J. Med. 278:636-612, 1968, which dealt with bacteriuria and blood pressure among nuns and working
women. Some more
variables which were included originally are
omitted here.
Table 2-1. Code Instruction-Bacteriuria and Blood Pressure
Column
Width
1-3
3
Identifying no.
4
1
Occupation
001-999 0-5
5
1
Race
0-1
6
1
Bacteriuria
0-7
1
Glucose
8
1
Protein in urine
Women Remarks
Range
Variable
7
in
0-1
in urine
0-1
9-10
2
Age
11-13
3
Height
00.0-99.9
14-17
4
Weight
000.0-999.9
18-20
3
Arm
21
1
Marital status
00-99
circumference
00.0-99.9
0-4
As 0
is
Nun
1
Nurse
2 Teacher
3 Clerical
4 Factory
5 Unskilled
0 White 0 None 2 Klebs. 4 Staph. 6 Others 0 No
0 No Years as
1
Negro
1
E. coli
3
Proteus
5
Yeast
7
Not done
1
1
Yes Yes
is
Inches in tenths
Pounds
in tenths
Centimeters in tenths
0 Never married 1 Married 2 Divorced 3
Widowed
4
Not known
Systolic blood
22-24
pressure
3
000-999
mm
Hg
as
is
000-999
mm
Hg
as
is
Diastolic blood
25-27
It will
point units
pressure
3
be noted that
it is
not necessary to introduce the decimal
punch card as long as the instruction explains the that are used. Although letters can be punched they are not itself in
the
practical for further data processing.
A
punch card from this study bearing the string of numbers (see 098201004465512652421145085 can be read: “Study No. 98. 44-year-old married white teacher. Urine: no glycose, no protein, > 200,000 E. coli. Weight 126.5 lb., height 5 ft. 5§ in. Arm circumference 24.2 cm, blood pressure 145/85. Fig. 2-1)
2.
9
Distributions
The position of the punch is all important. The column means of identifying the value of a variable.
is
the sole
Sorting of cards to obtain frequencies simply consists of placing
them in stacks for each class of a variable and counting the number. For example, one may start with two stacks, male and female. Each stack is again sorted in age groups, say, and again each age group sorted in some “yes-no” group. Hand sorting and counting is a task that must be meticulously performed to avoid misplacements and counting errors.
A
mechanical card sorter
is
a standard piece of
data-processing equipment which sorts punch cards quickly and accurately.
The observed frequencies should be tabulated in a distribution. There are work tables and final tables, the latter to accompany printed
or oral
presentations.
The work
voluminous, and detailed; they serve to after
tables
are
numerous,
facilitate statistical analysis
each variable distribution has been inspected.
Published tables should carry the information and demonstrate points discussed in the text. Although there are
no hard-and-fast
rules governing table construction, there are certain general prin-
have become accepted as more or less standard. Many journals have editorial instructions with respect to tables that are worth checking before the tables are finally prepared.
ciples that
good
1.
scientific
The
table should be as simple as possible.
Two
or three small
tables are preferable to a single large table that contains
many
details or variables. 2.
The
table should be self-explanatory.
For that purpose:
or symbols are used, these should
a.
If codes, abbreviations,
b.
Each row and each column should be labeled concisely and
be explained in detailed in a footnote. clearly. c.
d.
The The
specific units title
of measure for the data should be given.
should be clear, concise, and to the point.
A good
answer the questions: What? When? Where? Totals should be shown. These may be given in the top row and the first column to the left of the table or in the bottom
title will e.
row and the last column to the right. The exact position chosen will depend on the relative importance of the totals to the body of the table as well as the number of groups or
Introduction to Biostatistics
10
class intervals in the table. If the table
is
an unusually large
shown in both positions. commonly separated from the body of the
one, totals are sometimes 3.
The by
title is
may
the columns 4.
table
lines or spaces. If the table is small, vertical lines separating
If the
not be necessary.
data are not original, their source should be given in a
footnote.
The
simplest form of a table
a listing of the classes
is
of a variable scale and their observed frequencies. Because
most
comparison such one-way by cross tabulations of two or more
statistical analysis involves
tables are often replaced
variables. Editorial space limitations often force the author
to publish only the
Table 2-2
is
most informative
a two-way table of two nominal variables taken from
the study outlined in Table 2-1. (control
tables.
women and
The two
variables are (1) group
nuns) and (2) species of organism
(
E
.
coli,
Table 2-2. Distribution of Microorganisms Cultured from Two Populations of Females Found to Have Significant Bacteriuria Outside the Hospital
Control
Organism
No.
women
Nuns
Total
%
No.
%
No.
%
E. coli
77
62.6
43
79.6
120
67.8
Klebsiella
34
27.6
5
9.2
39
22.0
Staphylococcus
7
5.7
2
3.7
9
5.1
Proteus
5
4.1
2
3.7
7
3.9
Yeast Other
0 0
0.0
1
1.9
1
0.6
0.0
1
1.9
1
0.6
123
100.0
54
100.0
177
100.0
Total
Source: C.
M. Kunin and R. C. McCormack, New Eng.
J.
Med. 278: 638, 1968.
The table shows three nominal frequency distribuone for the control women, one for nuns, and one for the com-
Klebsiella , etc.). tions:
bined groups.
The
first
“point” of
this table is just to
show
the
frequency of various species. The nominal-scale classes have been rearranged so that their rank order is that of the overall relative
frequency.
The second point
is
to
show
that the
most frequent
Distributions
2.
microbe
11
somewhat
differs
in
relative frequency
between the two
study groups. Hence, the absolute counts (frequency) in the subsets are converted to relative frequencies or percentages of the sample size.
The sum of
these should be entered as 100.0 in the “total” line.
This helps the reader to understand clearly that the listed percentages are indeed relative frequencies.
A
different use of percentages
is
shown
in
Table
which
2-3,
is
a
three-way display of age, study group, and presence of bacteriuria.
dichotomous distributions (bacteriuria “yes-no”), giving the sample sizes, the frequencies of “yes” only, and their relative frequencies. The bottom “total” line contains two such distributions, disregardActually,
ing age.
this
The
is
a
tabulation
of 12
without regard for
line totals give six distributions
specific study group.
The tabular presentation of
a frequency distribution or distribu-
Table 2-3. Frequency of Significant Bacteriuria (“Cases”) 2,698 Control Women, By Age
Control
Age
No.
15-24 25-34 35-44 45-54 55-64
>65 Total
Source: C.
tions
is
women
Cases
Among
3,304 Nuns and
Nuns
%
No.
Cases
/o
4
0.4
2
0.3
7
1.4
613
33
5.4
742
32
4.3
598
30
5.0
960 768 484
495 219
19
3.8
385
6
1.6
14
6.4
310
10
3.2
31
2
6.5
397
23
5.6
2698
130
4.8
3304
52
1.6
M. Kunin and R. C. McCormack, New Eng.
J.
Med. 278: 638, 1968.
often supplemented by a graph; this can be an effective
of displaying the data, especially during oral presentation.
drawn
way
When
an overnumbers of various magnitudes can usually be seen more quickly and easily from a graph than from a table. There are many types of graphs but
correctly all
a graph allows the reader to obtain rapidly
grasp of the material presented.
The
relationship between
Introduction to Biostatistics
12
an understanding of a few general types will suffice for most ordinary medical data. The choice of the particular form of graph to be used is often a matter of personal preference. This is true also of
many
of the details of the graph
general principles that are
Some
itself.
There
are,
commonly accepted
however, certain
as being preferable.
of the most important of these are:
1.
The
simplest type of graph consistent with
most 2.
effective.
No more
purpose
is
symbols should be used
the in a
single graph than the eye can easily follow. Every graph should be completely self-explanatory. Therefore,
it
should be correctly labeled as to
and explanatory 3.
lines or
its
The
title,
source, scales,
.keys or legends.
position of the
for a graph
title
In published graphs, however, the
is
one of personal choice.
title is
commonly
placed
below the graph. 4.
When more
than one variable
is
shown on
a graph,
each
should be clearly differentiated by means of legends or keys. 5.
The diagram or graph
generally proceeds from left to right
and from bottom to top. All writing should be placed, therefore, so as to read from the bottom or from the right-hand side of the page. 6.
No more
coordinate lines should be shown than are necessary
to guide the eye. 7.
Scale lines should be
drawn heavier than other coordinate
lines. 8.
The
lines
of the graph
coordinate or scale 9.
The
itself
should be heavier than either
lines.
Frequency
is
method of
classification
generally represented on the vertical scale, with
simplest type of graph
on the horizontal. is
the bar diagram.
useful for characterizing frequency distributions of
It
is
especially
nominal
vari-
and quantitative variables of discrete type. Figure 2-2 is a simple bar graph based on the total data for 177 women as listed in Table 2-2. The nominal variable, microorganism type, is shown divided into its six classes: E. coli, Klebsiella etc. In this diagram the bars representing each type are drawn of equal width and with length proportional to their frequency. Therefore,
ables, ordinal variables,
,
bar area
is
proportional to length. Comparison of the length of the
2.
Distributions
13
bars gives a visual picture of the frequency of occurrence of the different types;
many
it
shows, for example, that there were about thrice as
cases of E. coli as Klebsiella.
Bars
may be drawn
either horizontally or vertically. Regardless
of the direction of the bars the scale line must start at zero or a
wrong impression
will result.
Usually the diagram will be more
attractive if the bars are wider than the spaces
between them. Pref-
erably the scale lines should be independent of the bars. Subclassifi-
may be shown by the use of multiple bars, in which case the diagram needs an appropriate legend. When one is comparing two or more proportional distributions of either qualitative or quantitative data, in which the number of classes is relatively small, a form of graph known as proportional bar diagram is especially useful. Data from Table 2-2 are shown in this type of diagram in Figure 2-3. In this diagram a single bar, 100% in length, represents each distribution. Each bar is then
cation of the data
divided into sections that correspond in length to the relative fre-
quencies of the classes. That
is,
there are 77 E. coli cases
among
the
123 controls so the length of the section representing these cases
62.6% of the control two dark
more of
bar. It
solid sections
and
is
is
easy to compare the lengths of the
see that for the
nuns proportionately
the cases were E. coli than for the controls.
Figure 2-3.
CONTROLS
The histogram
is
Proportional bar dia-
NUNS a diagram used exclusively for showing frequency
distributions of quantitative data that are continuous in nature. It is
an area diagram composed of adjacent rectangles.
essentially
Hence, the areas used to represent the class frequencies when added together will give the composite area for the entire distribution.
Figure 2-4 shows a histogram for the age distribution of the 2,698
800
600
400
200
0
AGE Figure 2-4. trol
14
women.
Histogram for the age
distribution of 2,698 con-
HEIGHTS FOR ADULTS
Figure 2-5.
control
FAMILY
INCOME
Illustrations of various types of frequency distributions.
women which was
vals are represented
given in Table 2-3. The 10-year age interby rectangles of equal width with heights pro-
portional to the numbers of observations falling in the intervals.
Thus the area above an age
interval indicates the frequency of
occurrence of ages in that interval. If the age intervals had not been equal the number of observations would have been equally dis-
number of years for each interval in the histogram. The histogram is a presentation of the sample frequency distribution. As sample size increases and class intervals are shortened, the tributed over the
sample approaches the entire universe of values and the outline of 15
Introduction to Biostatistics
16
becomes smooth. Distributions of various shapes are of these are characterized by the fact that they are found. symmetrical, have only one peak, and build up gradually from a fairly low number at the two extremes of the scale to a maximum in the middle. One distribution of this type is known as the normal the histogram
Many
distribution (Fig. 2-5 A).
skewed. They
may be skewed
whether the long
Two
Some
tail
examples of
distributions are asymmetrical or
to the right or to the
of the distribution
is
on the
left,
depending on
right or left side.
this type are illustrated in Figures 2-5
B and
D.
Both of these are skewed to the right. Figure 2 -5B has two peaks (bimodal) while Figure 2 -5D
is
a
more one-peak
of distribution. Figure 2-5C shows
(or unimodal) type
another distribution that
still
has two peaks, one in early infancy and one in old age. Applications of statistical techniques to biological data
often
assume, explicitly or implicitly, a particular mathematical formula for the distribution of the universe of values being sampled. This
formula gives the relative frequency, or density, for a given measure-
ment or scale class. Only three such distributions will be treated in this book the normal, the binomial, and the Poisson. Data measured on a continuous scale are often compatible with the assumption that they were drawn from a normal distribution with its familiar bell:
shaped frequency curve. This distribution
is
very briefly described
Data measured on a discrete scale in which the observations take on only the values 0, 1, 2, often follow the binomial or Poisson distributions (discussed in Chapter 7). in
Chapter
3.
.
PROBLEM
.
.
2-1
For a group of 105 individuals the blood plasma potassium (expressed in liter) ranged from 2.46 to 4.32. Assuming that you wish to classify these data into equal class intervals, set up the class-interval limits you would use. (See List of Answers to Problems, at end of book.) milliequivalents per
PROBLEM Assume
2-2
you are making a study of the relationship between the sodium cell and in the blood plasma. You wish at the same time to study the relationship between these factors and age, race, and sex. Set up a code instruction sheet for transfer of data to 80-column punch cards. in the red
that
blood
2.
Distributions
PROBLEM
17
2-3
The following paragraph is from J.A.M.A. 153: 1505-1508, 1953. Gastric resection was performed in 474 patients with benign gastric ulcer, and in 7 of these all the stomach was removed. Gastric resection was performed on 60 patients with malignant ulcers, and 9 total gastrectomies were performed in this group. In 23 patients with
benign ulcers and 21 patients with malignant
ulcers, other surgical procedures, including exploratory
and
laparotomy, repair of
were used. There were 415 patients with benign gastric ulcers and 7 patients with malignant gastric ulcers who had medical treatment only. perforation, gastroenterostomy,
ligation of a bleeding artery
and source
1.
Arrange these data
2.
Write a code instruction for data transfer to punch cards.
PROBLEM
in tabular form, with totals,
title,
reference.
2-4
In a serious epidemic of poliomyelitis in England and Wales in 1947,
shown
that the case fatality varied with the extent of paralysis.
it was For persons
with paralysis of limbs and/or trunk the case fatality was 5.8%; for those with paralysis of other parts of the body it was 32.8%; for those with no paralysis it
was 2.5%. Show these data
PROBLEM
2-5
The following
table
in the
form of a bar diagram.
shows the deaths from cancer of the breast and cancer of
the lung in the United States for the years 1939 to 1948.
Because of a change in classification by the National Office of Vital
Statistics,
data are not available for deaths from cancer of the lung after 1948. Deaths from cancer of the lung are grouped with deaths from cancer of the bronchus and trachea in the data for 1949 and
all
succeeding years. There were 20,909 deaths
recorded in 1954 as due to cancer of the breast; 24,788 deaths recorded as due to cancer of the lung, bronchus,
and trachea. Deaths from the
latter
two
entities
are relatively small in relation to those from cancer of the lung. 1.
Compare
graphically the change in the
number of deaths from cancer of
the
breast with that of cancer of the lung by: a. b.
Line diagrams on arithmetic paper. Line diagrams on arithmetic paper showing the deaths expressed as a percentage of the deaths in 1939.
c.
Line diagrams on arithlog paper.
in
each year
Introduction to Biostatistics
18
Deaths in the United States from Cancer of the Breast and Cancer of the Lung, 1939-1948
Cancer of lung
Cancer of breast No. of
% of deaths
No. of
1939
deaths
% of deaths 1939
Year
deaths
1939
14,868
100.0
5,120
1940
15,488
104.2
5,430
106.1
1941
15,526
104.4
6,025
117.7
1942
15,945
107.3
6,329
123.6
1943
16,140
108.6
7,088
138.4
1944
in
in
100.0
16,379
110.2
7,621
148.8
1945
17,133
115.2
8,162
159.4
1946
17,516
117.8
8,864
173.1
1947
18,030
121.3
9,571
186.9
1948
19,162
128.9
10,493
204.9
•
Source: Vital Statistics of the United States, 1939-1948.
from cancer of these two
2.
Discuss the change in mortality these three graphs.
3.
Under what conditions would you
sites as
shown by
prefer to use each of these types of graphs?
The Normal Distribution
The normal
or Gaussian distribution has a symmetric frequency
curve shaped like a
bell. It is
completely specified by two parameters
mean and standard
called the
The mean,
//,
deviation, usually denoted
m and
a.
indicates the center of the distribution, the standard
deviation, a, indicates the spread or variability of the distribution,
the formula for the relative frequency curve
/(*)
=
1
(x
-
is
m)2
2(r 2
V27TCT 2
and e are known mathematical constants. The probability of an observation, x, following in the interval Xi, to x 2 is given as the area where
7r
under the curve f(x) between x\ and jc2 These areas (probabilities) have been extensively tabled (see Table A).* As an example of biologic data approximating the normal pattern, consider the distribution of 1,060 Tulane University School of .
*
Table
A
appears on page 207. Similarly, keyed tables (through E) follow. 19
Introduction to Biostatistics
20
Medicine freshman medical students’ pulse beats counted for 60 seconds at the end of an hour’s lecture (Table 3-1). Examination of this distribution shows that it is reasonably symmetrical with the frequency low at both ends of the distribution and a
maximum
in the pulse beat
group of 75 to
78.
The
frequencies
Table 3-1. Distribution of 1,060 Students by Resting Pulse Beat
Pulse
Frequency
beat
Accumulated
Expected
relative frequency
midpoint ordinate
per
Midpoint
minute
43-46 47-50 51-54 55-58 59-62 63-66 67-70 71-74 75-78 79-82 83-86 87-90 91-94 95-98 99-102 103-106 107-110
Mean = *
=
Total
Expected*
Observed
Expected
("/(*))
44.5
1
0.5
0.0009
0.0005
0.8
48.5
2
1.6
0.0028
0.0020
2.8
52.5
6
5.1
0.0085
0.0069
8.4
56.5
22
14.0
0.0292
0.0201
21.3
45.5
60.5
52
32.3
0.0783
0.0506
64.5
79
62.9
0.1528
0.1099
82.2
68.5
118
103.6
0.2642
0.2077
125.2
72.5
165
144.3
0.4198
0.3439
160.9
76.5
186
170.0
0.5953
0.5042
174.4
80.5
165
169.2
0.7509
0.6638
159.5
84.5
103
142.5
0.8481
0.7983
123.0
88.5
82
101.4
0.9255
0.8940
80.1
92.5
45
61.1
0.9679
0.9516
43.9
96.5
19
31.1
0.9858
0.9809
20.3
100.5
11
13.4
0.9962
0.9935
7.9
104.5
3
4.9
0.9991
0.9981
2.6
108.5
1
1.5
1.0000
0.9995
0.7
0
0.6
1.0000
1.0000
110+
S.D.
Observed
—
76.4.
9.7.
=
1,060//
maximum is reached in the range from 75 to 79. Although these measurements are discrete measurements, their pattern is characteristic of the normal frequency increase in each class interval until the
and is roughly approximated by it. For the distribution of the sample of 1,060 students according to pulse beats the mean is 76.4 beats per minute and the standard devia-
distribution
tion (s.D.)
is
9.7 beats per minute.
When
these values are placed in
3.
Normal
Distribution
21
the above equation, and the values of
x corresponding
to the mid-
point of each class interval substituted, the corresponding values of f(x) can be determined.
The
curve, nf(x), has been superimposed
on
the histogram of the distribution (Fig. 3-1).
Figure 3-1.
Histogram of the observed distribution
3-1 with superimposed normal distribution with
S.D.
in
Table
mean 76.4 and
9.7.
This histogram of the actual distribution shows, on the whole, only fairly close agreement with the normal distribution, calculated for the sample of 1,060 individuals, with a
mean
pulse beat of 76.4
and a standard deviation of 9.7 beats. For a test of normality see Chapter 10 and Problem 10.1. The distribution as a whole has also been divided by erecting perpendicular lines or ordinates to the * axis at the
mean
of the
.
Introduction to Biostatistics
22
and at intervals of 1, 2, and 3 s.d.s on either side of the mean. These lines divide the area under both the curve and the
distribution
histogram into
six parts
— obviously
unequal
in area.
The area
be-
tween any two ordinates, however, is approximately the same under the normal curve as under the histogram. The area under the curve, therefore, could be used in place of the area under the histogram
and from
this the
number of observations
falling
between any two
ordinates could be estimated. It
should be recognized that the normal curve drawn for a partic-
depend on the value of the mean, the standard and the number of observations in the distribution. Thus there will be an infinite number of normal curves. Methods of estimating the area of a rectangle or a circle are easy but the methods of computing the area under the normal curve between any two ordinates are not so simple. For that reason, areas for the standard normal curve with mean 0 and standard deviation of have been calculated and tabled. In order that these areas may be applied to any specific curve, the x scale of the curve has been transformed into what is often referred to as a standard measure or relative deviate scale, x' This means that any given value is expressed as a number of standard deviations from the mean. For example, in a curve with /a = 10 and a a = 2 the value, x = 8 would be represented as one standard deviation from the mean since x' = (x — )/
if
skews to the
.
.
right,
and
it
is
0.5.
the distribution depends
/n,
p.
the underlying probability
on whether we consider the
n/n, the rate scale, or the “success” scale,
n.
of rates ( a/n )
=
of successes (a)
p.
=
np.
In Table 7-1, the variable *
is set
computed mean rate for each p. The standard deviation of a
equal to a/n or a/6, showing the
fohows the
single obser vation th at
V
binomial distribution with parameter p is p(l — p) = Vpq. From the rule of standard errors of the mean for n observations we obtain:
Standard error of rate {a/n)
Standard error of success ( a ) Table
which
is
7-1 gives also the results of the
= =
\^pq/n
Vnpq
summation
the variance of the rate distribution.
(x
It is easily
—
x) 2 -f(x),
seen that for
each p, the variance is pq/n. Similarly, the table shows that the variance of “successes” is npq. That the distributions for p = 0.1
and for p
=
0.9
have the same variance or standard error squared
follows naturally from the identity, np
In summary, is
we
(1
—
p)
=
n{ 1
—
p)p.
find that the distribution of rates or proportions
defined by one parameter, the underlying probability, p, size, n, is given.
when
the
sample
The normal distribution needs two independent mean and the standard deviation for specification. Knowledge of the underlying
parameters, the
distribution enables us to estimate
the probability that an observed rate, a/n,
is
a sample of a universal
assumed p. We compute the probability that a or fewer events would occur, assuming the probability of a
distribution of a given or
single occurrence to be p.
7.
Binomial and Poisson Distributions
Two methods
69
are used: the normal approximation and direct first is used for larger samples, the other
binomial expansion. The
for small samples with a few or
no events.
NORMAL APPROXIMATION Larger samples of the binomial distribution follow the central limit
theorem, which states that the distribution of a approximates
a
deviation
V npq
This approximation
if p is 0.5.
Larger
normal distribution, with mean np and standard is adequate with rather small n’s n’s are necessary the more different p is from 0.5. 7-3 compare the binomial and normal distribution n = 10 and n = 40, respectively. In Figure 7-2, the
Figure 7-2.
.
Figures 7-2 and for
p =
normal
Comparison of normal curve and “binomial curve’’ for (1/2
0.5
and
distribu-
1/2) 10
.
and standard deviation V2.5 (npq = 10 X 0.5 X 0.5); in Figure 7-3 the mean^ of the normal distribution is 20 and the standard deviation is VlO. In general, the tion has
mean
5,
(np
=
10
X
0.5)
Introduction to Biostatistics
70
approximation to the normal distribution np and nq are larger than
is
sufficiently close if
both
5.
In estimating the probability that an observed proportion a/n
is
from a theoretical or assumed p we use the normal variate is equivalent to t with infinite degrees of freedom (marked oo ). x' is formed using the number of successes, a and its mean and standard deviation based on p. Or, equivalently, a/n and its mean and standard deviation can be used. different
x',
which
X'
=
a foO
—
np
_ a/n — p
V npq
'Jpq/n
Example. In the year 1953, 1,523 deaths
in the
United States were
and 749 females, 0.5082 males. Assuming that p (the a proportion of 774/1,523 proportion of males in the universe) is 0.5, one can ask if the observed
attributed to rheumatic fever. There were 774 males
=
number of males
is
unlikely. It follows that:
Mean = np —
1,523
X
0.5
=
761.5
Standard error
=
V npq = V 1523 X
-
774
/oo
-
761.5
=
0.5
X
0.5
19.5
0.64.
19.5
Since the observed
t
does not exceed
1.96,
the 0.05 value (see
no evidence to reject the null hypothesis, which in this case amounts to stating that there is no sex preference in deaths caused by rheumatic fever. bottom
of Table B), there
line
When p limits for
is
not
known but
is
is
to be estimated
by a/n 95 ,
% confidence
are given by the formula
p
a/n
±
n
2
BINOMIAL EXPANSION When p
is
very different from 0.5 and
its
distribution
skewed, an approach through the normal distribution
For example, with a/n is
v
0.02
X
0.98/150
is
is
heavily
inaccurate.
= 0.02 and sample size 150, the standard error = 0.0114. However, subtraction of 2 standard
Binomial and Poisson Distributions
7.
Figure 7-3.
errors
71
Comparison of normal curve and “binomial curve’’ for (1/2
from a/n or 0.02
—
0.0228
=
—0.0028 leads
value for the lower confidence limit of p which ,
+
1/2) 40
.
to a negative
absurd.
is
In such circumstances, the frequency of each event and confidence intervals
can be found
in tables.* Alternatively the
formulae of
this
chapter can be used for direct calculation. Making these calculations is
currently eased considerably by use of electronic computers, but
often a desk calculator with sufficient digits and multiplication and division facilities can yield the result in shorter time than
takes to
it
obtain “time” on a large electronic computer.
Returning to the question of the confidence limit of p when a/n = 3/150, one can first approach the problem by assuming the observed rate
of
is
the “true” probability and proceeding to find the frequency
0, 1, 2, etc.,
occurrences out of 150. By adding each frequency in
sequence the accumulated frequency distribution us which
number of
bution. Thus, with
events are outside the
p =
0.02 and n
=
150
95%
we
results,
which
tells
limits of the distri-
find:
* See, for example, the Handbook of Tables for Probability and Statistics edited by William H. Beyer, Chemical Rubber Co., Cleveland, 1966. ,
0
/(0)
1
/( 1)
2
/( 2)
3
/( 3)
m
6 7
/( 7)
= q 150 = (0.98) 150 = /( 0) 150 p/q = /(l) 149/7/2^ = /(2) 148 p/3q
= = = =
0.048296
0.048296
0.147845
0.196141
0.224785
0.420926
0.226314
0.647240
= f(5)\45p/6q = /(6) 144 ppq
= =
0.049886
0.968009
0.020943
0.988952
Thus one can determine probability
95%
assumed “true” the observed a
The
Accumulated frequency
Frequency /(a)
Events (a)
direct
observed rate
if
p =
that events between 0
p's will lead to a confidence interval for
=
p based on
3.
computation of probability is
and 6 occur with
Repeating the calculation for other
.02.
either
is
when
simplest
0/n or n/n. The probability
is
the
then q n or
p
n 9
respectively.
In the example given in Table 6-2, there were seven patients in
whom
improvement in breathing was attempted. Six patients did show improvement and one had the same measurement before and after treatment. Let the null hypothesis
be that the probability of
one patient’s condition staying arrested or being improved is 0.5 and the probability of his worsening is 0.5. Then the probability of seven patients’ not worsening
is
(0.5)
7
=
hypothesis can then be rejected at the 0.01
Accepting a treatment probability of bility
effect
is
which the solution
The
null
level. is
the least
95%
its
proba-
expressed as follows 1
0.05
is
^ V O05 =
p ^ antilogy ^
p ^ The answer
0.0078.
improvement that contains 7/7 within
limits?” This
P
=
one can then ask “What
p ^ for
1/128
is
antilog (—0.18586)
0.65 that there
is
evidence that the patient’s chance of im-
provement is at least 0.65, or that his “odds” in favor of improvement are 2/1 or better. Chapter 16 will contain examples of utilization of the binomial distribution in estimating the sample size, n, for given p .
72
7.
Binomial and Poisson Distributions
73
THE POISSON DISTRIBUTION OR LAW OF SMALL NUMBERS A
frequently encountered circumstance
an event
is
very small but that even
number of events
small
counting blood
if
that the probability of
the sample size
is
are observed. For example,
drawn and
millions of cells are
cells,
is
large only a
when one
is
diluted so that
only a few cells out of the millions are likely to be counted in the
microscope
Counts of
grid.
per square in subdivisions of the
cells
grid are following a certain distribution called the Poisson distribu-
named
tion, so
The with
scale
French mathematician who first described it. an integer scale 0, 1, 2 that theoretically ends
after the
is
.
.
.
the large sample size, but in practice only low frequencies
n,
are observed.
The mean of
the distribution
that of the binomial
and
m
=
is
is
a noninteger which
usually called
L a X/(a)/Z/(«)
If p is 0.00001 and n = The standard error is
=
mean
200,000, the
is
identical to
m\ nP-
is 2.0.
from that of the binomial
also derived
distribution in a special case
Since p
is
very small,
1
a
=
—
p
m
=
V np is
(1
— p).
almost identical to
1
and the formula
becomes cr
The standard
V np
= Vm.
error of the Poisson distribution equals the square
its mean, or the mean and variance are equal. The expected frequency can be derived from the recurrence formula
root of
for the binomial.
We
had
/(“+»= /w I— Since q
is
almost identical to
1
and a
is
very small in comparison to n ,
the formula reduces to
For the
first
subset, 0,
we had /( 0 )
=
(1
- p)n
.
74
Introduction to Biostatistics
This reduces, approximately, for small /( 0)
where e It
is
=
'p’s
e~ n P
to
=
e~m ,
the base of the natural logarithm.
follows, then, that the Poisson frequency distribution
f(a)
The data presented
in
=
is
e~mm a /{a\).
Table 7-2 were gathered from the Annual
Report from the State Department of Public Health, Pennsylvania, is a rare disease for which the annual
in 1960. Multiple sclerosis
Table 7-2. Distribution of Medium-Size Pennsylvania Counties by Number of Deaths from Multiple Sclerosis (1960)
No. of deaths
No. of counties
Expected no.
0
18
19.4
1
13
10.9
2
3
3.0
3-1-
0
0.7
34
34.0
Total
m =
19/34
=
0.559.
is about 4 per 100,000 population. The 34 Pennsylvania whose population is between 15,000 and 90,000 reported 0, 1, and 2 deaths from multiple sclerosis in 1960. The number of counties so reporting were 18, 13, and 3, respectively. The mean of the distribution is then estimated to be per county
mortality counties
w= We can
(0
X
+ X
18
then expect
1
—
if
13
+
X
2
3) / 34
=
19/34
the Poisson distribution
is
=
valid
0.559.
— to find the
following frequency:
0 deaths 1
/( 0)
death
/( 1)
2 deaths
/( 2)
and the remainder, /( 3+) If
no
3 or
=
= = =
34
19.4 10.9
more
34.0
-
X
of Table 8-1, taken from th t Journal of the American Medical Association,
145:14, 1951.
Table 8-1. Results of Medical Management
in
101
Cases of Massive Gastric Hemorrhage
No. cases
No. deaths
% dying
1930-43
40
10
25.0
1944-49
61
5
8.2
Period
Source:
The authors
J.
state,
A.
M.
A., 145:\A, 1951.
“The
chief factor,
we
believe, that will explain
the decrease in the mortality rate from 25.0 to liberal use
of transfusion.” The
this difference
frequently in
first
8.2%
is
the
question to be answered
more is:
Is
between 25.0 and 8.2% of a magnitude that will occur samples coming from the same universe? Here the
80
Introduction to Biostatistics
fatality rate
there
is
a
of persons with gastric hemorrhage
common
is
not known, but
case-fatality rate, the best estimate of
it
is
if
that
based on the sum total of the data.
hemorrhage
14.8% of 101 persons with massive gastric Hence the estimate for the universe p is 0.148. The
15 or
In this case
died.
calculations necessary for the test of significance are therefore: a,
n
X
x
2
= = _
a2
10
40
n2
X
(10
61
40
= =
A = N=
5
61
- 40 X X 61 X
15
101
- 50.5) 15 X 86 5
2
101
13,053,265.25 3;
=
147,600
4.15
(p
This analysis shows that
=
1)(M
1
-
-L+
)
1)
2)
-
JL-rXM
P(r) (r
+
1)(^
~L+
r)
(r
+
1
))
A
most dramatic test of the efficacy of hyperimmune rabies serum (gamma globulin) was performed in Iran (Bull. World Health Organ., 75:747-772, 1955). Seventeen persons had been bitten in the head or neck by the same rabid wolf. The standard Pasteur vaccine treatment was given. In addition, 12 persons received one or more doses of antirabies gamma globulin. The results are sumExample.
marized in Table
8-4.
—
Table 8-4. Deaths from Rabies with and without Added to Pasteur V accine Treatment
Gamma
Globulin
Persons bitten by rabid wolf
With
gamma
globulin
12
Pasteur treatment only
5
Dead
Alive
1
11
2
3
Source: Bull. World Health Organ., 13:141-112, 1955.
Obviously the numbers are very small and the x 2 test would be of no validity. With Fisher’s exact test, the assumption is that only the occurrences that point to benefit of the new treatment ulin) are of interest.
Are one or no deaths out of
(gamma
glob-
12 an unlikely
occurrence compared with 3 out of 5?
The and
smallest marginal total
D=
17
-
12
=
4
=
L\
M=
12 since 1/12
5.
P( 0)
P( 1)
P( 0)
is
+ P{ 1)
= J.X 3X4X5 14 X 15 X 16 X 17 476 25
476
X
4 1
=
X 12 X 2
0.053
24
476
476
Q.
120 -
110
-
—
i 1
1.40
1
1
1
Surface
Figure 9-6.
1
1 1
2.00
1.80
1.60
area
In
square
i
2.20
'
r 2.40
meters
Relationship between total circulating protein and surface area
with regression line,
y =
40.604
+ 72.169*.
9.
Correlation and Regression
99
the corresponding value of
such as 2.4
is
y
is
obtained.
* and
substituted for
two paired values for x and y plotted on the diagram and the
Then
the value of
and
156.1)
(1.6,
a second value of
y
is
(2.4, 213.8) are
connecting them
line
x
obtained. These
is
now
the regression
line.
What do
and b indicate? The value of b indicates that for each square meter change in surface area of the body, the circulating protein increases (because b is positive) by 72.169 gm. The a value is the value of y when * is 0 that is, the intercept of the regression line and the y axis. This regression line, fitted to the the values of a
—
62 points of the scatter diagram,
is
pictured in Figure 9-6. There
considerable scatter of the dots about the
A
measure of the variation of these points around the
determined.
.
x.
line
can be
known as the standard error of estimate and usually The standard error of estimate measures the agree-
It is
denoted by s y
is
line.
ment between the y values observed and those predicted by the regression equation from the observed values of a:. If, as is usual, yi is an observation on the dependent variable, Xi the associated observation on the independent variable, and Yi = a + bxi is the value of the dependent variable predicted for equation, then the definition of sy
Sy.x
.
jc,-
by use of regression
x is:
— (a + bXjf\ n — 2 are n — 2. The term
under
as the residual variance.
Com-
2
[yi
—
and the associated degrees of freedom the square root sign
known
also
is
putational formulae are:
Ly -(Ly) /n-[Lxy (L*XEt)/”]7[I> - (I:*) /n] 2
Sy.x—
2
2
2
n-2
= sy V[l-r
2
][(n-l)/(n-2)].
The sample statistics a b like the sample mean, follow normal distributions when the original sample of j\s is from a normal distribution. The standard error of b is given by the formula: ,
,
= SyjVz x 2 - (£ and
this
xy/n,
provides a simple test of the null hypothesis that the true
slope, estimated
by
/;,
is
zero.
The quantity
t
=
is
compared
to the
1
100
Introduction to Biostatistics
cut-off points of the
value of
distribution (Table B, p. 208). If the calculated
t
beyond
lies
t
either cut-off point, b
nificantly different (statistically) level.
Other standard
from zero
declared to be sig-
is
at the
chosen probability
based on the standard error of
statistical tests
estimate are available. For instance, confidence limits can be
com-
y value predicted from an * value by the regression equation, and two or more regression lines can be tested for paralputed for the lelism.
For these additional
mediate
statistics text
tests the
reader
referred to an inter-
is
such as the Snedecor and Cochran volume
referred to earlier.
For the example on circulating
protein,
y and body surface ,
(Table 9-2), the standard error of estimate
a
/
-
2 000,748 ,
( 11
,
016)762
-
[
21 050.38 ,
-
(1
>
Sy
is
17 76 )( 1 .
,
.
area,
x
=
x
-
016 )/ 62 ] 2 /[ 225.4290
(
117 76 ) 762 ] .
60
=
23.903
with 60 degrees of freedom and the standard error of the slope s b 56
Thus, to
=
23 903 / •
test the
5%
/(0.05)
=
bjs b
cut-off points
=
•
4290
-
=
(117-76)762
18.013.
observed slope against zero, we calculate /
The
V225
is
2.000 and
=
72.1686/18.013
of the
— /(0.05) =
=
4.0.
distribution with 60 d.f. are t —2.000. Since the calculated value
%
cut-off point, it is concluded that there is exceeds the upper 5 significant evidence that the slope of the true regression line is not
of
t
zero
not horizontal) and a trend
(i.e.,
it
if b had would have been
(Notice that
exists.
been of the same magnitude but negative,
also
declared to be significantly different from zero because then the t value would have fallen short of the lower cut-off point. Another way of stating the test is that for b to be significant at the chosen level the calculated value / = b/s b must exceed the tabulated
calculated
cut-off point for
t
in absolute value,
Although the correlation cient,
b
,
i.e.,
coefficient, r,
without regard to sign.)
and the regression
are different statistics, the arithmetic values of
significance tests of each are always identical.
obtained for
r in
the previous example
was
Note t
=
that the 4.0, as
coeffi-
for the
t
/
value
was the
above t value for b. Complete data for calculation of a regression line involving only 10 points are given in Table 9-3. The two variables involved are age
—
Table 9-3. Death Rates (per 1,000 Population) by Age United States, 1965
—
Log io
Coded age
Death
death rate
Age
(*)
rate
(y)
40-44 45-49 50-54 55-59 60-64 65-69 70-74 75-79 80-84 85 and over
0
3.7
0.57
1
5.8
0.76
2
9.1
0.96
3
13.9
1.14
4
20.6
1.31
5
31.7
1.50 1.66
= = x = =
6
45.5
7
68.1
1.83
8
106.8
2.03
9
202.0
2.31
n
10
E*
45
'Ey =
4.5
y
285
Exy =
nExy -
,
ExEy
_
Ey
=
y
-
=
bx
1.407
-
14.07 1.407
22.6513
78.6400
- 45(14.07) - (45)2
10(78.64)
»Ex*-(E xy a
2
= =
10(285) 0.1858(4.5)
=
0.571
Source: Vital Statistics of the United States 1965, Vol. II Mortality Part A U.S. Department of Health, Education
—
and Welfare, Public Health Service (Washington,
and the 1965 United States death
1967).
rate per 1,000 population.
“y-shaped” curve of death rates plotted against age
However,
it
The
complex.
is
has been observed that for age 40 and over there
is
a
regular increase which can be well expressed by a linear regression
of logarithmic death rate on age.
The independent
expressed in five-year increments from 42.5 is
;
the logarithm (base 10) of the death rate.
variable, a,
is
age
the dependent variable
The
scatter
diagram
is
shown in Figure 9-7 with the fitted regression line drawn in. The necessary sums are given on Table 9-3 and the computational formulae for a and b
yield the estimates
shown on
the table
and the
equation log death rate
The
=
0.571
+
slope, b } represents the
,
0.186a
where a
number of
=
age
—
42.5
^
log units the death rate 101
102
Introduction to Biostatistics
changes with a five-year increase in age so that b/5
=
0.03715 gives
the annual change in log death rate, or, after converting to the death rate scale (with antilogs) the annual percentage change,
in death
c,
rate,
c
As
= -
100 [antilog (b/5) 100 [1.089
-
1]
—
1]
= 8.9%
per year.
is drawn by caland connecting them. For example if x = 0, y = 0.571, then the death rate = antilogy = 3.73 and if x = 3, y = 1.28 and the death rate = antilog (y) = 13.44. As a check, the point (x, y) should also fall on the line drawn.
for the previous example, the regression line
culating two points
1,000
Figure 9-7.
Linear regression
PER
of
logarithmic
death
rate
(United States, 1965) on age.
RATE
DEATH
possible to minimize the
It is
tions
x
deviations instead of the
y
devia-
by interchanging the roles of the x and y variables. This means
using the former dependent variable as independent and vice versa,
and represents a fundamental change problem. The resulting
y
=
a
+ bx
line,
in
+
the formulation of the
By
is
not identical with
(except in the case of perfect correlation) because dif-
ferent assumptions have been
imized.
x = A
The equation x = A
made and
+ By
is
different deviations
called the regression of
min-
* on y.
:
Correlation and Regression
9.
PROBLEM
103
9-1
Murphy and Gardner (unpublished lected
data, University of Pennsylvania) col-
from normal donors, labeled them with chromium-51, and them into patients who had no circulating platelets of their own because
platelets
injected
of acute leukemia or aplastic anemia. recipient (as percent of the injected
scopic enumeration
The
yield of platelets circulating in the
amount) was determined by
and isotope counting. The
result of 15
direct micro-
double measurements
are given below
Enumeration
1.
Find by the
y
=
a
+
(*)
Isotope count (v)
98
86
65
68
78
78
60
88
55
31
26
27
49
29
35
30
least squares
62
50
41
43
45
44
76
51
20
20
2
2
1
1
method
the parameters a
and b
in the
bx.
measurements y against x and draw the computed line. 3 Determine the standard error Sb of the slope. With the t test, are the following hypotheses accepted or rejected 2
.
equation
Plot the
.
at the
5%
level ?
4
.
Slope
=
0
5
.
Slope
=
1
methods
This was the investigators’ working hypothesis, i.e., that the two yield equal results. (See List of Answers to Problems.) .
PROBLEM
9-2
Could you calculate the correlation coefficient measuring the relationship between blood pressure (systolic) in white males and white females for the age group 50 to 54 years? If not, why?
Plasma Volume and Total Circulating Albumin
for 58
Normal Males
Total
Plasma volume Individual
in cc
number
(x)
albumin
gm
Individual
in cc
(y)
number
(x)
in
circulating
albumin in
gm
(y)
1
2,575
119
30
2,790
133
2
2,896
133
31
3,007
153
3
2,429
121
32
1,972
91
4
2,552
129
33
2,525
116
204
5
3,213
146
34
4,082
6
2,921
146
35
2,326
118
7
3,607
182
36
2,371
112
8
3,142
145
37
2,832
118
9
2,524
116
38
3,170
144
10
2,599
118
39
2,244
102
11
2,900
136
40
1,908
87
12
2,802
143
41
2,333
98
13
2,508
139
42
2,946
126
14
2,642
144
43
2,723
122
15
3,219
163
44
3,555
169
16
3,307
146
45
3,114
153
17
2,210
100
46
2,635
133
18
2,817
130
47
2,646
115
19
2,711
118
48
2,330
105
20
2,191
106
49
2,065
87
21
2,818
141
50
2,540
117
22
3,187
153
51
3,145
147
23
2,726
144
52
2,242
94
24
1,989
106
53
2,570
116 120
25
2,526
131
54
2,700
26
2,491
120
55
2,815
129
27
2,366
124
56
3,130
139
28
2,950
143
57
3,070
133
29
2,048
93
58
2,206
97
For these data: £ *2 = 434,974,920
104
Total
Plasma volume
circulating
£x— £y = 2
156,868 977,981
Zy = £ xy =
7,413
20,579,638
9.
Correlation and Regression
PROBLEM
105
9-3
Could you calculate the correlation
coefficient
between systolic and diastolic blood pressure
How
does this problem
PROBLEM
differ
in
measuring the relationship
white males of age 50 to 54?
from that expressed
in
Problem 9-2?
9-4
For 58 normal males, plasma volume and circulating albumin were determined (Ann. Surg., 727:352, 1945). These measurements are given at 1.
2. Is
there evidence of a relationship between circulating albumin and plasma
volume? 3.
left:
Plot these data in a scatter diagram.
If there
is, is it
linear in character? Positive or negative?
Calculate the value of the correlation coefficient showing the relationship be-
tween plasma volume and circulating albumin. 4. Is the correlation coefficient significant? 5.
Find the equation of the regression line that would be used to predict from plasma volume.
culating albumin 6.
Calculate the standard error of estimate for this regression
line.
cir-
E
Goodness of Fit
“Goodness of
fit” is
a generic term for tests that determine
if
the
observed frequencies in a distribution agree with those that would be expected according to some hypothesis.
The test compares the observed whole numbers (Oi) with expected numbers (E ) that because of computational circumstances are not necessarily integers. It is assumed that the observations are classified into k mutually exclusive and exhaustive categories so that i = 1,2, x
.../:.
The underlying formula X
2
=
is:
E (Oi -
Etf/Ei,
i
which has degrees of freedom (d.f.) where d.f. number of parameters estimated from the data
The
= k—
1
—
/,
/
is
in calculating
measure squared. The asThe whole numbers Oi are assumed to be samples of a distribution with mean Ei and the basic element
sumption
is
is
actually a standard
that of the Poisson distribution.
Poisson standard error 106
V
x
(see
Chapter
7).
(Oi
— E^/VEi
is,
:
10.
Goodness of
Fit
107
therefore, a standard
measure and (0,
— E) /E 2
2
{
is
approximately
The summation of these x s follows the 2 tion of x wit! 1 a number of degrees of freedom that are the formula above. It is used commonly by the geneticist distributed as xi
how
mining
is
distribu-
-
closely
genetic pattern. distribution
2
It is
given by in deter-
an observed distribution follows an expected used frequently to determine whether a particular
sufficiently like a
normal distribution so that the mean,
standard deviation, and table of areas of the normal curve used in describing
may be
it.
As an example, consider
in
Table
10-1
patients with evidence of gastric ulcer that
the distribution of 200
was
later
confirmed by
x-ray examination, according to total acid content of the stomach
following a stimulating dose of histamine.
and the standard deviation, sXi
is
The mean,
x,
is
100.4,
22.41.
Table 10-1. Distribution of 200 Patients with Gastric Ulcer Diagnosed According to Total Acid Content of Stomach Following Stimulating Dose of Histamine
Units of total acid
140 Total
Source
Observed (O)
Expected (E)
(O
O-E
- Ey E
6
7.2
-1.2
0.200
12
10.2
1.8
0.318
-1.9 -1.3
0.060
0.191
17
18.9
27
28.3
36
33.8
2.2
0.143
34
34.9
28
28.3
-0.9 -0.3
0.003
0.023
23
19.7
3.3
0.553
13
11.0
2.0
0.364
4
7.7
-3.7
1.778
200
200.0
0.0
3.633
Unpublished data from H. Bancroft.
The expected
values are
computed by determining the proportion
of the area of the normal curve that
falls
between the beginning of
and then multiplying these proportions in each group by «, the size of the sample. For example, to determine how many fall between 70 and 80 units of acid, the value x' = (70 — x)/sx is calculated and the area at x' is found from a successive class intervals
108
Introduction to Biostatistics
detailed table of areas of the
normal curve. Interpolation but slightly different
(p. 207) will give satisfactory
instance, (70
below
there
=
100.4)/22.41
Table
A
In this
—1.36. The proportion of the area
same way (80 — 100.4)/22.41 = —0.91. value of x’ is 0.1814. The area between 70 and 80
this x' is 0.0869. In the
The area then
—
in
results.
is
for this
0.1814
—
0.0869 or 0.0945. Since there are 200 in the group,
would be 200
X
0.0945 or 18.9 persons with total acid between
70 and 80 units. Expected frequencies in other class intervals are
determined by the same method.
Examination of Table 10-1 shows some difference between the observed and expected frequencies in every interval. The x 2 test is used to determine whether the observed frequency distribution is sufficiently different from the expected to reject the hypothesis that the sample comes from a normal distribution. The value of x 2 is calculated by using the formula
L (O -
2 X =
Ef/E.
To determine
the degrees of freedom here, the following is conThere are 10 pairs of frequencies for comparison in the analysis, (k = 10) and two parameters were estimated, x and sx sidered.
;
hence,
d.f.
exceed the
=
10
5%
—
—
1
2
=
The
7.
cut-off point, 14.067,
=
calculated xi 2
and the
fit is
considered to be
adequate. If the associated probability had been low 5 %), this
would have been evidence of a poor
As an example of
fitting genetic theory,
genetic genotypes of
gamma
3.69 does not
(e.g., less
than
fit.
consider Table 10-2.
globulin Gm(tf) and
Gm(b)
The
are co-
dominant so that phenotypes aa, ab, and bb can be identified. The two genes appear with different frequency in the population. Let the frequency of Gm(o) be p and that of Gm(Z>) be q so that p + q = 1. In a population with random mating the probabilities of one person possessing phenotypes aa ab or bb are p 2 2 pq and q 2 respectively. ,
,
The frequency of gene
,
,
,
,
patterns in married couples
is
similarly derived
from the expansion of ( p 2 + 2 pq + q 2 ) 2 The derived probabilities are given in the second column of Table 10-2. In order to assess the gene frequency in the sample with maximum likelihood, one considers the total of 4 n or 996 genes. Of these, Gm (a) occurs as many times as the letter a occurs in the mating pattern. For example, in the pattern ab X bb it occurs once, but four times in the pattern aa X aa. To estimate p, one adds the .
,
10.
Goodness of
Fit
109
product of gene occurrence and observed pattern frequency and sum with 4 n. Thus the estimate of p is
divides the
+
58
+
(306
+
58
+8+
+q=
1,
=
p
and that of q
Note that p
when p
is
24
+
16
+
30)/996
-
0.2309,
364)/996
=
0.7691.
is
=
q
(102
estimated
30
+
so that a separate estimate of q is unnecessary for purposes of checking one’s arith-
— except
metic.
Table 10-2. Distribution of Phenotypes of Gamma Globulin Factors Gm(«) and Gm(6) in a Random Sample of 249 Married Couples
Probability of occurrence
Mating (phenotypes)
(fi)
X X X X X X
4 pq 4 p2q2
ab ab ab aa
aa bb
3
bb
ab aa aa bb bb
=
5
—
1
Expected
m
frequency (Et
(O
t
= n Xfi)
P
1
EiY
102
104.6
0.065
29
31.4
0.183
0.317
4
2pV
—
Ei
4 P3q
Total
d.f.
Observed frequency
=3
15
15.7
0.031
91
87.1
0.175
249
249.0
X2
=
0.771
P > 5%
Estimate of gene frequency of Gm(a):
p
=
[(102)
q
=
1
+
-p =
Source
.
Little,
Brown
2(29)
+
3(8)
+ 4(4) +
2(1 5)]/(4(249))
=
230/996
=
0.2309
0.7691 in D. W. Clark and Company, Boston, 1967.
A. G. Steinberg
&
The expected in the first row,
B.
frequencies are then
MacMahon, Preventive Medicine
computed
as
E = t
nf
t
,
;fi is
p. 107.
4 pq 3
and
E = (249)4(0.2309)(0.769 1) = 104.6. The sum of the E$ should equal the sample size, n. Before proceeding to compute x the expectancies E 3
l
2
>
t
(in the fourth
10-2) should be inspected for low values. The approximation to the x 2 -distribution is not good if any expectancies are less than 5.0, and it is unacceptable with expectancies less than 1 .0.
column of Table
110
Introduction to Biostatistics
X
In this example, the pattern aa
aa
expected to occur 0.7 times
is
out of 249. The usual circumvention of this dilemma
is
to
groups of small expectancies until the sum of expectancies
X
ably high. In this example, patterns aa
mutual expectancy of
to yield a
9.5
+
aa and ab
0.7
=
loss in degrees of freedom. Originally, there
10.2.
were
X
combine
is
reason-
aa are added
The
effect is
a
six subsets in the
With the collapsing of two groups into one, there are — 1 = 3, final degrees of freedom are then 5 — because only one parameter had to be estimated from the data. The 2 X is found to be 0.771, which is much smaller than the 5% cut-off distribution.
only five subsets. The
1
point for 3 degrees of freedom. (7.82).
It
that the sample distribution of couples
a
random
.
can therefore be concluded
fits
the genetic hypothesis of
distribution of a population with the stated gene fre-
quencies.
mX
n CONTINGENCY TABLES
An m X
n contingency table presents the frequencies of joint
occurrence of two attributes that each have more than two subsets so that there are
m
columns and n rows of frequencies. Totals for
each row and for each column are given. The usual null hypothesis tested
is
that each
row has
the
same
distribution as the
row forming
the total of columns or, equivalently, that the distribution in each
column
is
the
same
as that in the
column
for
row
total.
We
have already considered significance tests for differences of two or several rates (see Chapter 8). These problems can be viewed as 2 X 2 or 2 X n contingency tables having two columns for a
dichotomous
and two or more rows for groups. The
classification
following test for an
mX
n contingency table
equivalent to the
is
(uncorrected) tests previously presented for the special case
m=
when
2.
Let the frequency in the
sum of
zth
frequencies in the /th
in the y'th
column be
row and yth column be O ij9 and let the row be r», and the sum of frequencies
Finally, let the
Cj.
n
m
1
1
sum of
all
frequencies be
N = E r, = E c,. The expected frequency Eij
=
for the observed Oij
Cj
X
rj/N
=
Vi
X
is
Cj/N.
then
Goodness of
10.
We
Fit
111
agreement between the observed and expected
test the
fre-
quency by
X
E (Orj -
=
2
2
Eij) / Eij.
compute expectancies, two
In order to
of parameters are
sets
needed: the two sets of marginal probabilities, r
each
set
one probability
that the total
(m
—
1)
+
the others are
= m X
mX n
{
—
n groups: 1
—
[(m
—
+ (n —
1)
1)1.
term can be simplified to read, for the special case of n contingency table, as follows:
right side
mX
d.f.
= (m -
\){n
-
1).
For example, consider the data of Table is
and c /N. In computed so
number of independently estimated parameters is 1). The degrees of freedom are then calculated by
d.f.
an
when
given
/N
—
(n
the usual formula, given
The
is
{
classified into three subclasses, as is
Stage of gestation
10-3.
blood
loss,
making
a total of
nine compartments.
Table 10-3. Hemorrhage in Premature Separation of the Placenta, by Amount of Blood Lost and by Stage of Gestation
Observed frequencies (On)
Total blood loss (ml)
1,000
6
23
22
51
0.228
Total lor columns
(cy)
Ci/N
1,000
7.5
22.5
21.0
51.0
33.0
99.0
92.0
224.0
Total
*
(rn" VO >n 00 ^3- ov ro ro (N (N 'vT
o O O o
m
O
OS hr
r*1
O
1950.
Cancer
00
^OO\
00
O(NMM ro
fN SO from
1920,
\t-
for
s/:
cj
« o © _ E o
Q
Rates
— — — SO unfNOsn^ro - ro « (N O SO — OO tN
Statistics
f\|
Vital Mortality
soooos^O'Or^'O" Os f^ (N fi 'j fO and
^
-loose and
—
o
es
xs numbers.
© | © ©
Reports
Ct'fi^tooOfO'tfN OOssor^'OroOS'Or^
G.
Cancer,
whole
Population
SOfOfOOsn'OQi’O — so — so — — O^noo fN oor^OfOfNSO^O wf O OS oo rT OO r? MSOOMOC'tOOO
from
to off
r-’
x
©
rfrosOr^r^'O'fl^'M
C3
Deaths
1950
Census.
©
/5
*s S3
GO
rounded i
3
and
21920
so O' oo OO — so •— fN — sO — — «o 00 «o O © O o — OO oo © — oo (+)•
Prevalence of disease
Specificity
= TN/Z>( — ).
Positivity of procedure
D(-f)//i
= P (+)///.
Total
(//)
a
— 146
Introduction to Biostatistics
Sensitivity
is
the proportion of true positive
Specificity
is
the proportion of true negative
In order that a procedure can at
given disease one must If this
sum
sensitivity
excluded
demand
among diseased. among nondiseased.
be of diagnostic value for a
all
that sensitivity
+
specificity
2.0, the test is ideal. If a screening test is
is
approaching
1
.0, it
among persons
means
>
that disease can practically be
with a negative
Effort for definite
test.
diagnosis can then be concentrated on persons with positive
The
yield of true positives will
depend on the
specificity
cedure and prevelance of disease. The tuberculin skin ple,
1.0.
one with
has high sensitivity, but the positive
test includes
tests.
of the pro-
test, for
exam-
those with past
symptom-free infection, partially developing disand active clinical tuberculosis. In comparing two diagnostic procedures in the same disease, sensitivity and specificity are important measurements. The statistical test infection, present
ease,
that evaluates the significance of observed differences in sensitivity
and
specificity
between two procedures performed
tients is a special case
of the x
2
For example,
test.
in the
same pa-
an
article in
in
Surgery (64: 332-338, 1968), Sigel et ah compare clinical diagnoses with the Doppler ultrasound method in the diagnosis of lowerextremity venous occlusion. Venography was used to confirm or reject the diagnosis.
Naturally the clinical diagnosis was assessed
without knowledge of the outcome of the Doppler
test
or of venog-
raphy. Table 13-3 presents the findings in 44 extremities with venous
Table 13-3. Lower Extremities With or Without Venous Occlusion Distribution by Results of Clinical Diagnosis and of Doppler Test
Clinical
Doppler
Venous
Normal
diagnosis (1)
test (2)
occlusion
veins
0
0 0
+ +
27 (FiS*)
22 (S,S 2 )
4 (F,F 2 )
1
5 (S,F 2 )
44
Total
Sensitivity
Clinical:
41 (S,S 2 )
(F
3
+ +
0
F2 )
(SiFO 16(F,S 2 )
3
77
Specificity (3
+ +
22)/44
Doppler:
(16
Data from
B. Sigel et
22)/44 al.,
= =
0.57.
(5
0.86.
(27
+ +
41 )/77
41)/77
Surgery 64:332-338, 1968.
= =
0.60. 0.88.
13. Evaluation of Effectivity
and Risk
147
occlusion and in 77 extremities for which venographic examination
showed normal conditions. The usefulness of each procedure is shown by the fact sum of sensitivity and specificity exceeds 1.0. A statistical procedure
validity of the
2X2
For the
performed in each as the usual x 2
is
Chapter
tables (see
that the test for
test for
8).
clinical diagnosis
(25
X
-
77
44
X
44
X
X
77
-
31
56
X
60.5)
65
2
X
121
=
2.46
>
5%).
(P