*130*
*107*
*1MB*

*English*
*Pages [259]*
*Year 2019*

- Author / Uploaded
- Sharon L. Lohr

Solutions Manual for Sampling: Design and Analysis Second Edition by

Sharon L. Lohr

c 2009 by Sharon L. Lohr Copyright

Chapter 1

Introduction 1.1 Target population: Unclear, but presumed to be readers of Parade magazine. Sampling frame: Persons who know about the telephone survey. Sampling unit = observation unit: One call. (Although it would also be correct to consider the sampling unit to be a person. The survey is so badly done that it is difficult to tell what the units are.) As noted in Section 1.3, samples that consist only of volunteers are suspect. This is especially true of surveys in which respondents must pay to participate, as here— persons willing to pay 75 cents a call are likely to have strong opinions about the legalization of marijuana, and it is impossible to say whether pro- or anti-legalization adherents are more likely to call. This survey is utterly worthless for measuring public opinion because of its call-in format. Other potential biases, such as requiring a touch-tone telephone, or the sensitive subject matter or the ambiguity of the wording (what does “as legal as alcoholic beverages” mean?) probably make little difference because the call-in structure destroys all credibility for the survey by itself. 1.2 Target population: All mutual funds. Sampling frame: Mutual funds listed in newspaper. Sampling unit = observation unit: One listing. As funds are listed alphabetically by company, there is no reason to believe there will be any selection bias from the sampling frame. There may be undercoverage, however, if smaller or new funds are not listed in the newspaper. 1.3 Target population: Not specified, but a target population of interest would be persons who have read the book. Sampling frame: Persons who visit the website Sampling unit = observation unit: One review.

1

2

CHAPTER 1. INTRODUCTION

The reviews are contributed by volunteers. They cannot be taken as representative of readers’ opinions. Indeed, there have been instances where authors of competing books have written negative reviews of a book, although amazon.com tries to curb such practices. 1.4 Target population: Persons eligible for jury duty in Maricopa County. Sampling frame: County residents who are registered voters or licensed drivers over 18. Sampling unit = observation unit: One resident. Selection bias occurs largely because of undercoverage and nonresponse. Eligible jurors may not appear in the sampling frame because they are not registered to vote and they do not possess an Arizona driver’s license. Addresses on either list may not be up to date. In addition, jurors fail to appear or are excused; this is nonresponse. A similar question for class discussion is whether there was selection bias in selecting which young men in the U.Swere ˙ to be drafted and sent to Vietnam. 1.5 Target population: All homeless persons in study area. Sampling frame: Clinics participating in the Health Care for the Homeless project. Sampling unit: Unclear. Depending on assumptions made about the survey design, one could say either a clinic or a homeless person is the sampling unit. Observation unit: Person. Selection bias may be a serious problem for this survey. Even though the demographics for HCH patients are claimed to match those of the homeless population (but do we know they match?) and the clinics are readily accessible, the patients differ in two critical ways from non-patients: (1) they needed medical treatment, and (2) they went to a clinic to get medical treatment. One does not know the likely direction of selection bias, but there is no reason to believe that the same percentages of patients and non-patients are mentally ill. 1.6 Target population: Female readers of Prevention magazine. Sampling frame: Women who see the survey in a copy of the magazine. Sampling unit = observation unit: One woman. This is a mail-in survey of volunteers, and we cannot trust any statistics from it. 1.7 Target population: All cows in region. Sampling frame: List of all farms in region. Sampling unit: One farm. Observation unit: One cow. There is no reason to anticipate selection bias in this survey. The design is a single-

3 stage cluster sample, discussed in Chapter 5. 1.8 Target population: Licensed boarding homes for the elderly in Washington state. Sampling frame: List of 184 licensed homes. Sampling unit = observation unit: One home. Nonresponse is the obvious problem here, with only 43 of 184 administrators or food service managers responding. It may be that the respondents are the larger homes, or that their menus have better nutrition. The problem with nonresponse, though, is that we can only conjecture the direction of the nonresponse bias. 1.13 Target population: All attendees of the 2005 JSM. Sampling population: E-mail addresses provided by the attendees of the 2005 JSM. Sampling unit: One e-mail address. It is stated that the small sample of conference registrants was selected randomly. This is good, since the ASA can control the quality better and follow up on nonrespondents. It also means, since the sample is selected, that persons with strong opinions cannot flood the survey. But nonresponse is a potential problem—response is not mandatory and it might be feared that only attendees with strong opinions or a strong sense of loyalty to the ASA will respond to the survey. 1.14 Target population: All professors of education Sampling population: List of education professors Sampling unit: One professor Information about how the sample was selected was not given in the publication, but let’s assume it was a random sample. Obviously, nonresponse is a huge problem with this survey. Of the 5324 professors selected to be in the sample, only 900 were interviewed. Professors who travel during summer could of course not be contacted; also, summer is the worst time of year to try to interview professors for a survey. 1.15 Target population: All adults Sampling population: Friends and relatives of American Cancer Society volunteers Sampling unit: One person Here’s what I wrote about the survey elsewhere: “Although the sample contained Americans of diverse ages and backgrounds, and the sample may have provided valuable information for exploring factors associated with development of cancer, its validity for investigating the relationship between amount of sleep and mortality is questionable. The questions about amount of sleep and insomnia were not the focus of the original study, and the survey was not designed to obtain accurate responses to those questions. The design did not allow

4

CHAPTER 1. INTRODUCTION

researchers to assess whether the sample was representative of the target population of all Americans. Because of the shortcomings in the survey design, it is impossible to know whether the conclusions in Kripke et al. (2002) about sleep and mortality are valid or not.” (pp. 97–98) Lohr, S. (2008). “Coverage and sampling,” chapter 6 of International Handbook of Survey Methodology, ed. E. deLeeuw, J. Hox, D. Dillman. New York: Erlbaum, 97–112. 1.25 Students will have many different opinions on this issue. Of historical interest is this excerpt of a letter written by James Madison to Thomas Jefferson on February 14, 1790: A Bill for taking a census has passed the House of Representatives, and is with the Senate. It contained a schedule for ascertaining the component classes of the Society, a kind of information extremely requisite to the Legislator, and much wanted for the science of Political Economy. A repetition of it every ten years would hereafter afford a most curious and instructive assemblage of facts. It was thrown out by the Senate as a waste of trouble and supplying materials for idle people to make a book. Judge by this little experiment of the reception likely to be given to so great an idea as that explained in your letter of September.

Chapter 2

Simple Probability Samples 2.1 (a) y¯U =

98 + 102 + 154 + 133 + 190 + 175 = 142 6

(b) For each plan, we first find the sampling distribution of y¯. Plan 1: Sample number 1 2 3 4 5 6 7 8

P (S) 1/8 1/8 1/8 1/8 1/8 1/8 1/8 1/8

y¯S 147.33 142.33 140.33 135.33 148.67 143.67 141.67 136.67

1 1 1 (i) E[¯ y ] = (147.33) + (142.33) + · · · + (136.67) = 142. 8 8 8 1 1 1 (ii) V [¯ y ] = (147.33 − 142)2 + (142.33 − 142)2 + · · · + (136.67 − 142)2 = 18.94. 8 8 8 (iii) Bias [¯ y ] = E[¯ y ] − y¯U = 142 − 142 = 0. (iv) Since Bias [¯ y ] = 0, MSE [¯ y ] = V [¯ y ] = 18.94 Plan 2: Sample number 1 2 3

P (S) 1/4 1/2 1/4

y¯S 135.33 143.67 147.33

1 1 1 (i) E[¯ y ] = (135.33) + (143.67) + (147.33) = 142.5. 4 2 4 5

6

CHAPTER 2. SIMPLE PROBABILITY SAMPLES

(ii) 1 1 1 (135.33 − 142.5)2 + (143.67 − 142.5)2 + (147.33 − 142.5)2 4 2 4 = 12.84 + 0.68 + 5.84

V [¯ y] =

= 19.36. (iii) Bias [¯ y ] = E[¯ y ] − y¯U = 142.5 − 142 = 0.5. (iv) MSE [¯ y ] = V [¯ y ] + (Bias [¯ y ])2 = 19.61. (c) Clearly, Plan 1 is better. It has smaller variance and is unbiased as well. 2.2 (a) Unit 1 appears in samples 1 and 3, so π1 = P (S1 ) + P (S3 ) = Similarly, π2 = π3 = π4 = π5 = π6 = π7 = π8 = Note that

P8

i=1 πi

1 4 1 8 1 8 1 8 1 8 1 4 1 4

+ + + + + + +

= 4 = n.

(b) Sample, S {1, 3, 5, 6} {2, 3, 7, 8} {1, 4, 6, 8} {2, 4, 6, 8} {4, 5, 7, 8}

P (S) 1/8 1/4 1/8 3/8 1/8

tˆ 38 42 40 42 52

Thus the sampling distribution of tˆ is: k 38 40 42 52

P (tˆ = k) 1/8 1/8 5/8 1/8

3 8 1 4 3 8 1 8 1 8 1 8 1 8

5 8 3 = 8 1 5 + = 8 8 1 = 4 3 5 + = 8 8 3 = 8 3 1 7 + + = . 8 8 8 =

1 1 1 + = . 8 8 4

7 2.3 No, because thick books have a higher inclusion probability than thin books. 2.4 (a) A total of ( 83 ) = 56 samples are possible, each with probability of selection 1 56 . The R function samplist below will (inefficiently!) generate each of the 56 samples. To find the sampling distribution of y¯, I used the commands samplist

i=1 k=1

= 0. (d) If πik ≥ (n − 1)πi πk /n for all i and k, then µ ¶ πik n−1 min ≥ πi for all i. πk n Consequently, N X

µ min

i=1

πik πk

¶

N

≥

n−1X πi = n − 1. n i=1

(e) Suppose V (tˆHT ) ≤ V (tˆψ ). Then from part (b), 0 ≤ V (tˆψ ) − V (tˆHT ) ¶µ ¶ N N µ n−1 ti tk 2 1 XX πik − πi πk − = 2 n πi πk i=1 k6=i # ¶ "µ ¶2 µ ¶2 N N µ n−1 ti 1 XX tk ti tk πik − = πi πk + −2 2 n πi πk πi πk i=1 k6=i # ¶ "µ ¶2 N X N µ X n−1 ti ti tk = πik − πi πk − n πi πi πk i=1 k6=i

=

N X N X

µ πik

i=1 k6=i

+

n−1 n

ti πi

¶2

N X N X

−

N X N X i=1 k6=i

N

ti tk n−1X πik − πi (n − πi ) πi πk n i=1

µ

ti πi

ti tk

i=1 k6=i

µ ¶2 X µ ¶2 N N X N N X X ti ti tk ti = (n − 1)πi − πik − (n − 1) πi πi πi πk πi i=1

+

i=1 k6=i

n−1 n

N X i=1

t2i +

n−1 n

N X N X i=1 k6=i

i=1

ti tk

¶2

137 Consequently, N

N

N

N

XX n − 1 XX ti tk ti tk ≥ πik , n πi πk i=1 k=1

or

N N X X

i=1 k6=i

aik ti tk ≥ 0,

i=1 k=1

where aii = 1 and aik = 1 −

n πik n − 1 πi πk

if i 6= k.

The matrix A must be nonnegative definite, which means that all principal submatrices must have determinant ≥ 0. Using a 2 × 2 submatrix, we have 1 − a2ik ≥ 0, which gives the result. 6.24 (a) We have

i

πik 1 2 3 4 πk

1 0.00 0.31 0.20 0.14 0.65

k 2 0.31 0.00 0.03 0.01 0.35

3 0.20 0.03 0.00 0.31 0.54

4 0.14 0.01 0.31 0.00 0.46

πi 0.65 0.35 0.54 0.46 2.00

(b) Sample, S tˆHT VˆHT (tˆHT ) VˆSYG (tˆHT ) {1, 2} 9.56 38.10 -0.9287681 5.88 -4.74 2.4710422 {1, 3} {1, 4} 4.93 -3.68 8.6463858 {2, 3} 7.75 -100.25 71.6674365 21.41 -165.72 323.3238494 {2, 4} {3, 4} 3.12 3.42 -0.1793659 P P Note that i k>i πik Vˆ (tˆHT ) = 6.74 for each method.

138

CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

6.25 (a) P {psu’s i and j are in the sample} = P {psu i drawn first and psu j drawn second} =

+P {psu j drawn first and psu i drawn second} ψj aj ai ψi + N N 1 − ψi 1 − ψj P P ak ak

k=1

=

k=1

ψi (1 − ψi )ψj

+

N P

ψj (1 − ψj )ψi N P

ak (1 − ψi )(1 − πi ) ak (1 − ψj )(1 − πj ) k=1 µ ¶ ψi ψj 1 1 + N 1 − πi 1 − πj P ak

k=1

=

k=1

(b) Using (a), P {psu i in sample} =

N X

πij

j=1,j6=i N X

µ ¶ ψi ψj 1 1 = + N 1 − πi 1 − πj P j=1,j6=i ak k=1 ½X µ ¶ ¾ N ψi 1 1 2ψi = N ψj + − 1 − πi 1 − πj 1 − πi P j=1 ak k=1 ½ ¾ N X ψj ψi 1 πi = N + − 1 − πi 1 − πj 1 − πi P j=1 ak k=1 ½ ¾ N X ψj ψi = N 1+ 1 − πj P j=1 ak k=1

N P

In the third step above, we used the constraint that

j=1

Now note that 2

PN

k=1 ak

=2 =

πj = n = 2, so

N X ψk (1 − ψk )

k=1 N X k=1

=1+

1 − 2ψk

ψk (1 − 2ψk + 1) 1 − 2ψk N X k=1

ψk . 1 − πk

N P j=1

ψj = 1.

139 Thus P {psu i in sample} = 2ψi = πi . (c) Using part (a), πi πj − πij

µ ¶ ψi ψj 1 1 = 4ψi ψj − N + 1 − πi 1 − πj P ak ¶ ¸ · µk=1 N P ak (1 − πi )(1 − πj ) − (1 − πj + 1 − πi ) ψi ψj 4 k=1 = . µN ¶ P ak (1 − πi )(1 − πj ) k=1

Using the hint in part (b), 4

µX N

¶ ak (1 − πi )(1 − πj ) − (1 − πj + 1 − πi )

k=1

· N X = 2 1+ k=1

¸ ψk (1 − πi )(1 − πj ) − 2 + πi + πj 1 − πk

= 2(1 − πi )(1 − πj ) = 2(1 − πi )(1 − πj )

N X k=1 N X k=1

ψk + 2 − 2πi − 2πj + 2πi πj − 2 + πi + πj 1 − πk ψk − πi − πj + 2πi πj 1 − πk

= ≥ 2(1 − πi )ψj + 2(1 − πj )ψi − πi − πj + 2πi πj = (1 − πi )πj + (1 − πj )πi − πi − πj + 2πi πj = 0. Thus πi πj − πij ≥ 0, and the SYG estimator of the variance is guaranteed to be nonnegative. P 6.26 The desired probabilities of inclusion are πi = 2Mi / 5j=1 Mj . We calculate ψi = πi /2 and ai = ψi (1 − ψi )/(1 − πi ) for each psu in the following table: psu, i 1 2 3 4 5 Total

Mi 5 4 8 5 3 25

πi 0.40 0.32 0.64 0.40 0.24 2.00

ψi 0.20 0.16 0.32 0.20 0.12 1.00

ai 0.26667 0.19765 0.60444 0.26667 0.13895 1.47437

According to Brewer’s method, P (select psu i on 1st draw) = ai

5 .X j=1

aj

140

CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

and

. P (psu j on 2nd draw | psu i on 1st draw) = ψj (1 − ψi ).

Then P {S = (1, 2)} =

.26667 0.16 = 0.036174, 1.47437 0.8

P {S = (2, 1)} =

.19765 0.2 = 0.031918, 1.47437 0.84

and π12 = P {S = (1, 2)} + P {S = (2, 1)} = 0.068. Continuing in like manner, we have the following table of πij . i\j 1 2 3 4 5 Sum

1 — .068 .193 .090 .049 .400

2 .068 — .148 .068 .036 .320

3 .193 .148 — .193 .107 .640

4 .090 .068 .193 — .049 .400

5 .049 .036 .107 .049 — .240

We use (6.21) to calculate the variance of the Horvitz-Thompson estimator. ³ ´2 t i j πij πi πj ti tj (πi πj − πij ) πtii − πjj 1 1 1 1 2 2 2 3 3 4 Sum

2 3 4 5 3 4 5 4 5 5

0.068 0.193 0.090 0.049 0.148 0.068 0.036 0.193 0.107 0.049 1

0.40 0.40 0.40 0.40 0.32 0.32 0.32 0.64 0.64 0.40

0.32 0.64 0.40 0.24 0.64 0.40 0.24 0.40 0.24 0.24

20 20 20 20 25 25 25 38 38 24

25 38 24 21 38 24 21 24 21 21

47.39 5.54 6.96 66.73 20.13 19.68 3.56 0.02 37.16 35.88 243.07

Note that for this population, t = 128. To check the results, we see that ΣP (S)tˆHT S = 128

and

ΣP (S)(tˆHT S − 128)2 = 243.07.

6.27 A sequence of simple random samples with replacement (SRSWR) is drawn until the first SRSWR in which the two psu’s are distinct. As each SRSWR in the sequence is selected independently, for the lth SRSWR in the sequence, and for

141 i 6= j, P {psu’s i and j chosen in lth SRSWR} = P {psu i chosen first and psu j chosen second} +P {psu j chosen first and psu i chosen second} = 2ψi ψj . The (l + 1)st SRSWR is chosen to be the sample if each of the previous l SRSWR’s is rejected because the two psu’s are the same. Now P (the two psu’s are the same in an SRSWR) =

N X

ψk2 ,

k=1

so because SRSWR’s are drawn independently, P (reject first l SRSWR’s) =

µX N

ψk2

¶l .

k=1

Thus πij

= P {psu’s i and j are in the sample} ∞ X P {psu’s i and j chosen in (l + 1)st SRSWR, = l=0

and the first l SRSWR’s are rejected} ¶l µ N ∞ X X 2 ψk (2ψi ψj ) = k=1

l=0

=

1−

2ψi ψj PN

2 k=1 ψk

.

Equation (6.18) implies πi =

PN

j=1,j6=i πij

¶ .µ PN 2 = j=1,j6=i 2ψi ψj 1 − k=1 ψk ¶ ¶ Áµ Áµ PN PN 2 2 2 1 − k=1 ψk = 2ψi 1 − k=1 ψk − 2ψi ¶ Áµ P 2 = 2ψi (1 − ψi ) 1− N k=1 ψk . PN

Note that, as (6.17) predicts, N X i=1

µ ¶Áµ ¶ N N X X 2 2 πi = 2 1 − ψi 1− ψk = 2 i=1

k=1

142

CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

6.28 Note that, using the indicator variables, n X

Iki = 1

for all i,

k=1

xki = PN P (Zi = 1 | I11 , . . . , InN ) = and tˆRHC =

Mi

,

j=1 Ikj Mj n X Iki Mi PN j=1 Ikj Mj k=1

=

n X

Iki xki

k=1

n n X N X X tα(k) ti = . Iki Zi xk,α(k) xki k=1 i=1

k=1

We show that tˆRHC is conditionally unbiased for t given the grouping E[tˆRHC | I11 , . . . , InN ] = E = = =

·X n X N

k=1 i=1 N n XX

Iki

k=1 i=1 N n X X

ti | I11 , . . . , InN Iki Zi xki

¸

N ti X Ikl xkl xki l=1

Iki ti

k=1 i=1 N X

ti = t.

i=1

Since E[tˆRHC | I11 , . . . , InN ] = t for any random grouping of psu’s, we have that E[tˆRHC ] = t. To find the variance, note that V [tˆRHC ] = E[V (tˆRHC | I11 , . . . , InN )] + V [E(tˆRHC | I11 , . . . , InN )]. Since E[tˆRHC | I11 , . . . , InN ] = t, however, we know that V [E(tˆRHC | I11 , . . . , InN )] = 0. Conditionally on the grouping, the kth term in tˆRHC estimates the total of group k using an unequal-probability sample of size one. We can thus use (6.4) within each group to find the conditional variance, noting that psu’s in different groups are selected independently. (We can obtain the same result by using the indicator

143 variables directly, but it’s messier.) Then V (tˆRHC | I11 , . . . , InN ) =

n X N X

µ Iki xki

=

k=1 i=1

=

¶2

j=1

k=1 i=1 n X N X

N

X ti − Ikj tj xki

Iki

t2i − xki

N X N X N X

n X N X N X

Iki ti Ikj tj

k=1 i=1 j=1

µ Iki Ikj

k=1 i=1 j=1

¶ Mj t2i − ti tj . Mi

item Now to find E[V (tˆRHC | I11 , . . . , InN )], we need E[Iki ] and E[Iki Ikj ] for i 6= j. Let Nk be the number of psu’s in group k. Then E[Iki ] = P {psu i in group k} =

Nk N

and, for i 6= j, E[Iki Ikj ] = P {psu’s i and j in group k} = Thus, letting ψi = Mi /

Nk Nk − 1 . N N −1

PN

j=1 Mj ,

V [tˆRHC ] = E[V (tˆRHC | I11 , . . . , InN )] µ ·X ¶¸ N X N n X Mj t2i Iki Ikj = E − ti tj Mi k=1 i=1 j=1

=

N n X X k=1 i=1

µX n

µ ¶ N N n X X X Nk Nk − 1 Mj t2i Nk 2 2 (t − ti ) + − ti tj N i N N −1 Mi k=1 i=1 j=1,j6=i

¶ t2i 2 = −t ψ i=1 i k=1 ¶µ X · ¸2 ¶ µX n N ti Nk Nk − 1 ψi −t . = N N −1 ψi Nk Nk − 1 N N −1

¶µ X N

i=1

k=1

The second factor equals nV (tˆψ ), with V (tˆψ ) given in (6.46), assuming one stage cluster sampling. What should N1 , . . . , Nn be in order to minimize V [tˆRHC ]? Note that n X k=1

Nk (Nk − 1) =

n X k=1

Nk2 − N

144

CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

is smallest when all Nk ’s are equal. If N/n = L is an integer, take Nk = L for k = 1, 2, . . . , n. With this design, V [tˆRHC ] =

µ ¶2 N ti L−1 X ψi −t N −1 ψi i=1

=

N −n ˆ V (tψ ). N −1

6.29 (a) X k6=i

π ˜ik =

X

πi πk 1 − (1 − πi )(1 − πk )/

N X

cj

j=1

k6=i

X

πi (1 − πi ) X = πi πk − PN πk (1 − πk ) j=1 cj k6=i k6=i # "N πi (1 − πi ) X πk (1 − πk ) − πi (1 − πi ) = πi (n − πi ) − PN j=1 cj k=1 = πi (n − πi ) − πi (1 − πi ) + = πi (n − 1) +

πi2 (1 − πi )2 PN j=1 cj

πi2 (1 − πi )2 . PN j=1 cj

(b) If an SRS is taken, πi = n/N , so N X

cj =

j=1

N ³ X n ³ n´ n´ 1− =n 1− N N N j=1

and π ˜ik = = = = = =

" # (1 − n/N )(1 − n/N ) n2 ¡ ¢ 1− n N2 n 1− N ni n h n − 1 + N 2· N ¸ n n−1 n + 2 N N N · ¸ n n−1 N −1 n−1 n + 2 N N −1 n−1 N N · ¸ n n−1 N −1 n(N − 1) + N N −1 N (n − 1)N 2 · ¸ n n−1 N −n 1+ N N −1 (n − 1)N 2

145 (c) First note that πi πk − π ˜ik = πi πk Then, letting B =

PN

j=1 cj ,

VHaj (tˆHT ) = = = =

=

= =

µ

¶ ti tk 2 − πi πk i=1 k=1 µ ¶ N N 1 XX tk 2 (1 − πi )(1 − πk ) ti − πi πk PN 2 πi πk j=1 cj i=1 k=1 ¶ µ 2 N X N X 1 ti tk ti πi πk (1 − πi )(1 − πk ) 2 2 − 2 P πi πk πi 2 N j=1 cj i=1 k=1 µ 2 ¶ N X N X 1 ti ti tk πi πk (1 − πi )(1 − πk ) − PN πi2 πi πk j=1 cj i=1 k=1 ÃN !2 N X X ti 1 t2i πi (1 − πi ) πi (1 − πi ) 2 − PN πi π c j i j=1 i=1 i=1 ÃN !2 N X ti X t2i 1 ci ci 2 − PN πi π c j i j=1 i=1 i=1 µ 2 ¶2 N X ti ci −A . πi2 i=1 N

6.30

(1 − πi )(1 − πk ) . PN j=1 cj

N

1 XX (πi πk − π ˜ik ) 2

146

CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

(a) From (6.21), N

V (tˆHT ) =

i=1 N

=

N

µ

k=1

ti tk − πi πk

¶2

k6=i N

1 XX (πi πk − πik ) 2 i=1

=

N

1 XX (πi πk − πik ) 2

µ

k=1

ti t t tk − + − πi n n πk

¶2

k6=i N

1 XX (πi πk − πik ) × 2 i=1

(µ

k=1

k6=i

¶ µ ¶µ ¶) tk t 2 ti t tk t + − −2 − − πk n πi n πk n ( µ ¶ µ ¶ µ ¶) N X N X t 2 ti t tk t ti = (πi πk − πik ) − − − − πi n πi n πk n i=1

ti t − πi n

¶2

µ

k=1

k6=i

From Theorem 6.1, we know that N X

πk = n

k=1

and

N X

πik = (n − 1)πi ,

k=1

k6=i

so µ ¶ ¶ µ N X N N X X ti t 2 ti t 2 (πi πk − πik ) − = [πi (n − πi ) − (n − 1)πi ] − πi n πi n i=1

i=1

k=1

k6=i

=

N X i=1

µ πi (1 − πi )

t ti − πi n

¶2 .

This gives the first two terms in (6.47); the third term is the cross-product term above. (b) For an SRS, πi = n/N and πik = [n(n − 1)]/[N (N − 1)]. The first term is µ ¶ N N X n N ti t 2 XN S2 − = (ti − t¯U )2 = N (N − 1) t . N n n n n i=1

i=1

147 The second term is µ N ³ X n ´2 N ti

t − n n

N

i=1

¶2 = n(N − 1)

St2 . n

(c) Substituting πi πk (ci + ck )/2 for πik , the third term in (6.47) is µ ¶µ ¶ N X N X ti t tk t (πik − πi πk ) − − πi n πk n i=1

k=1

k6=i

=

N X N X i=1

k=1

ci + ck − 2 πi πk 2

µ

ti t − πi n

¶µ

tk t − πk n

¶

k6=i

µ ¶µ ¶ X ¶ µ N t tk t t 2 ci + ck − 2 ti ti 2 − − = πi πk − − πi (ci − 1) 2 πi n πk n πi n i=1 k=1 i=1 ¶ µ N X t 2 ti − = πi2 (1 − ci ) . πi n N X N X

i=1

Then, from (6.47), V (tˆHT ) =

i=1

≈

µ

¶ t 2 ti πi − − πi n i=1 i=1 ¶µ ¶ µ N N X X t tk t ti − − (πik − πi πk ) + πi n πk n N X

N X

πi i=1 N X

ti t − πi n

¶2

N X

¶2

N X

µ

πi2

k=1

k6=i

µ

ti t − πi n

− µ

i=1

µ πi2

ti t − πi n

¶2

¶ ti t 2 + − ci ) − πi n i=1 µ ¶ N X ti t 2 πi (1 − ci πi ) = − . πi n πi2 (1

i=1

If ci =

n−1 n−πi ,

then the variance approximation in (6.48) for an SRS is

N X i=1

µ πi (1 − ci πi )

ti t − πi n

¶2

µ ¶ N X N n−1 = 1− (ti − t¯U )2 n N −1 i=1 µ ¶ n−1 N (N − 1) 1− St2 . = n N −1

148

CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

If

n−1 ´ ci = ³ P 2 1 − 2πi + n1 N π k=1 k

then N X

µ πi (1 − ci πi )

i=1

ti t − πi n

¶2

N (N − 1) n

=

µ ¶ n(n − 1) 1− St2 . N −n

6.31 We wish to minimize µ ¶2 N N 1X ti 1 X Mi2 Si2 ψi −t + n ψi n mi ψi i=1

i=1

subject to the constraint that C=E

·X

¸ mi = E

·X N

¸ Qi mi = n

i=1

i∈S

N X

ψi mi .

i=1

Using Lagrange multipliers, let g(m1 , . . . , mN , λ) =

N X M 2S2 i

i=1

∂g ∂mk ∂g ∂λ

i

mi ψi

µ ¶ N X −λ C −n ψi mi . i=1

Mk2 Sk2 + nλψk m2k ψk N X =n ψi mi − C. =−

i=1

Setting the partial derivatives equal to zero gives 1 Mk Sk mk = √ nλ ψk and

√ X N √ n λ= Mi Si . C i=1

Thus, the optimal allocation has mi ∝ Mi Si /ψi . For comparison, a self-weighting design would have mi ∝ Mi /ψi . 6.32 Let Mi be the number of residential numbers in psu i. When you dial a number based on the method, P (reach working number in psu i on an attempt) = P (select psu i)P (get working number | select psu i) 1 Mi = . N 100

149 Also, µ ¶ N X 1 Mi M0 P (reach no one on an attempt) = 1− =1− . N 100 100N i=1

Then, P (select psu i as first in sample) = P (select psu i on first attempt) + P (reach no one on first attempt, select psu i on second attempt) + P (reach no one on first and second attempts, select psu i on third attempt) + ... µ ¶ µ ¶ 1 Mi M0 1 Mi M0 2 1 Mi = 1− + 1− + ... + N 100 N 100 100N N 100 100N ¶ ∞ µ M0 j 1 Mi X 1− = N 100 100N j=0

=

1 Mi N 100

=

Mi M0

1 ¶ µ M0 1− 1− 100N

150

CHAPTER 6. SAMPLING WITH UNEQUAL PROBABILITIES

Chapter 7

Complex Surveys 7.6 Here is SAS code for solving this problem. Note that for the population, we have y¯U = 17.73, Ã2000 !2 µ ¶ 2000 2 X X 1 1 35460 2 2 yi − yi = 704958 − /1999, S = 38.1451726 = 1999 2000 2000 i=1

i=1

θˆ25 = 13.098684, θˆ50 = 16.302326, θˆ75 = 19.847458. data integerwt; infile integer delimiter="," firstobs=2; input stratum y; ysq = y*y; run; /* Calculate the population characteristics for comparison */ proc means data=integerwt mean var; var y; run; proc surveymeans data=integerwt mean sum percentile = (25 50 75); var y ysq; /* Without a weight statement, SAS assumes all weights are 1 */ run; proc glm class model means run;

data=integerwt; stratum; y = stratum; stratum;

151

152

CHAPTER 7. COMPLEX SURVEYS

/* Before selecting the sample, you need to sort the data set by stratum */ proc sort data=integerwt; by stratum; proc surveyselect data=integerwt method=srs sampsize = (50 50 20 25) out = stratsamp seed = 38572 stats; strata stratum; run; proc print data = stratsamp; run; data strattot; input stratum datalines; 1 200 2 800 3 400 4 600 ;

_total_;

proc surveymeans data=stratsamp total = strattot mean clm sum percentile = (25 50 75); strata stratum; weight SamplingWeight; var y ysq; run; /* Create a pseudo-population using the weights */ data pseudopop; set stratsamp; retain stratum y; do i = 1 to SamplingWeight; output ; end; proc means data=pseudopop mean var; var y; run; proc surveymeans data=pseudopop mean sum percentile = (25 50 75);

153 var y ysq; run; The estimates from the last two surveymeans statements are the same (not the standard errors, however). 7.7 Let y = number of species caught. y 1 3 4 5 6 7 8 9

fˆ(y) .0328 .0328 .0820 .0328 .0656 .0656 .1803 .0328

y 10 11 12 13 14 16 17 18

fˆ(y) .2295 .0491 .0820 .0328 .0164 .0328 .0164 .0164

Here is SAS code for constructing this table: data nybight; infile nybight delimiter=’,’ firstobs=2; input year stratum catchnum catchwt numspp depth temp ; select (stratum); when (1,2) relwt=1; when (3,4,5,6) relwt=2; end; if year = 1974; /*Construct empirical probability mass function and empirical cdf.*/ proc freq data=nybight; tables numspp / out = htpop_epmf outcum; weight relwt; /*SAS proc freq gives values in percents, so we divide each by 100*/ data htpop_epmf; set htpop_epmf; epmf = percent/100; ecdf = cum_pct/100; proc print data=htpop_epmf; run; 7.8 We first construct a new variable, weight, with the following values:

154

CHAPTER 7. COMPLEX SURVEYS

Stratum large

weight 245 Mi 23 mi

sm/me

66 Mi 8 mi

Because there is nonresponse on the variable hrwork, for this exercise we take mi to be the number of respondents in that cluster. The weights for each teacher sampled in a school are given in the following table: dist sm/me sm/me sm/me sm/me sm/me sm/me sm/me sm/me large large large large large large large large large large large large large large large large large large large large large large large

school 1 2 3 4 6 7 8 9 11 12 13 15 16 18 19 20 21 22 23 24 25 28 29 30 31 32 33 34 36 38 41

popteach 2 6 18 12 24 17 19 28 33 16 22 24 27 18 16 12 19 33 31 30 23 53 50 26 25 23 21 33 25 38 30

mi 1 4 7 7 11 4 5 21 10 13 3 24 24 2 3 8 5 13 16 9 8 17 8 22 18 16 5 7 4 10 2

The epmf is given below, with y =hrwork.

weight 16.50000 12.37500 21.21429 14.14286 18.00000 35.06250 31.35000 11.00000 35.15217 13.11037 78.11594 10.65217 11.98370 95.86957 56.81159 15.97826 40.47826 27.04013 20.63859 35.50725 30.62500 33.20972 66.57609 12.58893 14.79469 15.31250 44.73913 50.21739 66.57609 40.47826 159.78261

155 y 20.00 26.25 26.65 27.05 27.50 27.90 28.30 29.15 30.00 30.40 30.80 31.25 32.05 32.10 32.50 32.90 33.30 33.35 33.75 34.15

fˆ(y) 0.0040 0.0274 0.0367 0.0225 0.0192 0.0125 0.0050 0.0177 0.0375 0.0359 0.0031 0.0662 0.0022 0.0031 0.0370 0.0347 0.0031 0.0152 0.0404 0.0622

y 34.55 34.60 35.00 35.40 35.85 36.20 36.25 36.65 37.05 37.10 37.50 37.90 37.95 38.35 38.75 39.15 40.00 40.85 41.65 52.50

fˆ(y) 0.0019 0.0127 0.1056 0.0243 0.0164 0.0022 0.0421 0.0664 0.0023 0.0403 0.1307 0.0079 0.0019 0.0163 0.0084 0.0152 0.0130 0.0018 0.0031 0.0020

7.10 Without weights

156

CHAPTER 7. COMPLEX SURVEYS With weights

Using the weights makes a huge difference, since the counties with large numbers of veterans also have small weights. 7.13 The variable agefirst contains information on the age at first arrest. Missing values are coded as 99; for this exercise, we use the non-missing cases. Estimated Quantity Mean Median 25th Percentile 75th Percentile

Without Weights 13.07 13 12 15

With Weights 13.04 13 12 15

Calculating these quantities in SAS is easy: simply include the weight variable in PROC UNIVARIATE. The weights change the estimates very little, largely because the survey was designed to be self-weighting. 7.14 Quantity Age ≤ 14 Violent offense Both parents Male Hispanic Single parent Illegal drugs

Variable age crimtype livewith sex ethnicity livewith everdrug

pˆ .1233 .4433 .2974 .9312 .1888 .5411 .8282

7.15 (a) We use the following SAS code to obtain yˆ ¯ = 18.03, with 95% CI [17.48, 18.58].

157 data nhanes; infile nhanes delimiter=’,’ firstobs=2; input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2 dmdeduc indfminc bmxwt bmxbmi bmxtri bmxwaist bmxthicr bmxarml; label age = "Age at Examination (years)" riagendr = "Gender" ridreth2 = "Race/Ethnicity" dmdeduc = "Education Level" indfminc = "Family income" bmxwt = "Weight (kg)" bmxbmi = "Body mass index" bmxtri = "Triceps skinfold (mm)" bmxwaist = "Waist circumference (cm)" bmxthicr = "Thigh circumference (cm)" bmxarml = "Upper arm length (cm)"; run; proc surveymeans data=nhanes mean clm percentile = (0 25 50 75 100); stratum sdmvstra; cluster sdmvpsu; weight wtmec2yr; var bmxtri age; run; (b) The data appear skewed.

(c) The SAS code in part (a) also gives the following.

158 Percentile Minimum 25 50 75 Maximum

CHAPTER 7. COMPLEX SURVEYS Value 2.8 10.98 16.35 23.95 44.6

Std Error 0.177 0.324 0.425

Men: Percentile Minimum 25 50 75 Maximum

Value 2.80 9.19 12.92 18.11 42.4

Women: Percentile Minimum 25 50 75 Maximum

Value 4.00 14.92 21.94 28.36 44.6

(d) Here is SAS code for constructing the plots: data groupage; set nhanes; bmigroup = round(bmxbmi,5); trigroup = round(bmxtri,5); run; proc sort data=groupage; by bmigroup trigroup; proc means data=groupage; by bmigroup trigroup; var wtmec2yr; output out=circleage sum=sumwts; goptions reset=all; goptions colors = (black); axis3 label=(’Body Mass Index, rounded to 5’) order=(10 to 70 by 10); axis4 label=(angle=90 ’Triceps skinfold, rounded to 5’) order=(0 to 55 by 10); /* This gives the weighted circle plot */

159

proc gplot data=circleage; bubble trigroup * bmigroup= sumwts/ bsize=12 haxis = axis3 vaxis = axis4; run; /* The following draws the bubble plot with trend line */ ods graphics on; proc loess data=nhanes; model bmxtri=bmxbmi / degree = 1 select=gcv; weight wtmec2yr; ods output OutputStatistics = bmxsmooth ; run; ods graphics off; proc print data=bmxsmooth; run; proc sort data=bmxsmooth; by bmxbmi; goptions reset=all; goptions colors = (gray); axis4 label=(angle=90 ’Triceps skinfold’) order = (0 to 55 by 10); axis3 label=(’Body Mass Index’) order=(10 to 70 by 10); axis5 order=(0 to 55 by 5) major=none minor=none value=none; symbol interpol=join width=2 color = black; /* Display the trend line with the bubble plot */ data plotsmth; set bubbleage bmxsmooth; /* concatenates the data sets */ run; proc gplot data=plotsmth; bubble bmxtri*bmxbmi = sumwts/ bsize=10 haxis = axis3 vaxis = axis4; plot2 Pred*bmxbmi/haxis = axis3 vaxis = axis5; run;

160

CHAPTER 7. COMPLEX SURVEYS

7.17 We define new variables that take on the value 1 if the person has been a victim of at least one violent crime and 0 otherwise, and another variable for injury. The SAS code and output follows. data ncvs; infile ncvs delimiter = ","; input age married sex race hispanic hhinc away employ numinc violent injury medtreat medexp robbery assault pweight pstrat ppsu; if violent > 0 then isviol = 1; else isviol = 0; if injury > 0 then isinjure = 1;

161 else isinjure = 0; run; proc surveymeans data=ncvs; weight pweight; strata pstrat; cluster ppsu; var numinc isviol isinjure; run; proc surveymeans data=ncvs; weight pweight; strata pstrat; cluster ppsu; var medexp; domain isinjure; run; The SURVEYMEANS Procedure Data Summary Number Number Number Sum of

of Strata of Clusters of Observations Weights

143 286 79360 226204704

Statistics

Variable numinc isviol isinjure

N 79360 79360 79360

Mean 0.070071 0.013634 0.003754

Std Error of Mean 0.002034 0.000665 0.000316

95% CL for Mean 0.06605010 0.07409164 0.01232006 0.01494718 0.00312960 0.00437754

Domain Analysis: isinjure

isinjure 0 1

Variable N medexp 79093 medexp 267

Mean 0 101.6229

Std Error of Mean 0 33.34777

95% CL for Mean 0.0000000 0.000000 35.7046182 167.541160

162

CHAPTER 7. COMPLEX SURVEYS

P 7.18 Note that q1 ≤y≤q2 yf (y) is theP sum of the middle N (1 − 2α) observations in the population divided by N , and q1 ≤y≤q2 f (y) = F (q2 ) − F (q1 ) ≈ 1 − 2α. Consequently, y¯U α =

sum of middle N (1 − 2α) observations in the population . N (1 − 2α)

To estimate the trimmed mean, substitute fˆ, qˆ1 , and qˆ2 for f , q1 , and q2 . 7.21 As stated in Section 7.1, the yi ’s are the measurements on observation units. If unit i is in stratum h, then wi = Nh /nh . To express this formally, let ½ 1 if unit i is in stratum h xhi = 0 otherwise. Then we can write wi =

H X Nh h=1

and

X P y

xhi

nh

yi wi

y fˆ(y) = X i∈S

wi

i∈S

X =

H X (Nh /nh )xhi yi

i∈S

h=1

H XX (Nh /nh )xhi i∈S h=1

H X

=

X (xhi yi /nh )

Nh

h=1 H X

i∈S

Nh

X

(xhi /nh )

i∈S

h=1 H X

Nh y¯h

=

h=1 H X

Nh

h=1

=

H X Nh h=1

N

y¯h .

7.22 For an SRS, wi = N/n for all i and X fˆ(y) =

i∈S:yi =y

XN i∈S

n

N n .

163 Thus,

X

y 2 fˆ(y) =

X y2

i , n y i∈S X X yi y fˆ(y) = = y¯, n y i∈S

and

·X ¸2 ¾ ˆ ˆ y f (y) − y f (y) y y ½X 2 ¾ N yi 2 = − y¯ N −1 n i∈S N X (yi − y¯)2 = N −1 n i∈S N n−1 2 = s . N −1 n

Sˆ2 =

N N −1

½X

2

If n < N , Sˆ2 is smaller than s2 (although they will be close if n is large). 7.23 We need to show that the inclusion probability is the same for every unit in S2 . Let Zi = 1 if i ∈ S and 0 otherwise, and let Di = 1 if i ∈ S2 and 0 otherwise. We have P (Zi = 1) = πi and P (Di = 1 | Zi = 1) ∝ 1/πi . P (i ∈ S2 ) = P (Zi = 1, Di = 1) = P (Di = 1 | Zi = 1)P (Zi = 1) 1 πi = 1. ∝ πi 7.24 A rare disease affects only a few children in the population. Even if all cases belong to the same cluster, a disease with estimated incidence of 2.1 per 1,000 is unlikely to affect all children in that cluster. 7.25 (a) Inner-city areas are sampled at twice the rate of non-inner-city areas. Thus the selection probability for a household not in the inner city is one-half the selection probability for a household in the inner city. The relative weight for a non-inner-city household, then, is 2. (b) Let π represent the probability that a household in the inner city is selected. Then, for 1-person inner city households, P (person selected | household selected) P (household selected) = 1 × π. For k-person inner-city households, P (person selected | household selected) P (household selected) =

1 π. k

Thus the relative weight for a person in an inner-city household is the number of adults in the household. The relative weight for a person in a non-inner-city household is 2 × (number of adults in household).

164

CHAPTER 7. COMPLEX SURVEYS

The table of relative weights is: Number of adults 1 2 3 4 5

Inner city 1 2 3 4 5

Non-inner city 2 4 6 8 10

Chapter 8

Nonresponse 8.1 (a) Oversampling the low-income families is a form of substitution. One advantage of substitution is that the number of low-income families in the sample is larger. The main drawback, however, is that the low-income families that respond may differ from those that do not respond. For example, mothers who work outside the home may be less likely to breast feed and less likely to respond to the survey. (b) The difference between percentage of mothers with one child indicates that the weighting does not completely adjust for the nonresponse. (c) Weights were used to try to adjust for nonresponse in this survey. We can never know whether the adjustment is successful, however, unless we have some data from the nonrespondents. The response rate for the survey decreased from 54% in 1984 to 46% in 1989. It might have been better for the survey researchers to concentrate on increasing the response rate and obtaining accurate responses instead of tripling the sample size. Because the survey was poststratified using ethnic background, age, and education, the weighted counts must agree with census figures for those variables. A possible additional variable to use for poststratification would be number of children. 8.2 (a) The respondents report a total of X yi = (66)(32) + (58)(41) + (26)(54) = 5894 hours of TV, with X yi2 = [65(15)2 + 66(32)2 ] + [57(19)2 + 58(41)2 ] + [25(25)2 + 26(54)2 ] = 291725. Then, for the respondents, 5894 = 39.3 150 291725 − (150)(39.3)2 = 403.6 = 149

y¯ = s2

165

166

CHAPTER 8. NONRESPONSE

and

sµ SE(¯ y) =

¶ 150 403.6 1− = 1.58. 2000 150

Note that this is technically a ratio estimate, since the number of respondents (here, 150) would vary if a different sample were taken. We are estimating the average hours of TV watched in the domain of respondents. (b) GPA Group 3.00–4.00 2.00–2.99 Below 2.00 Total X2 =

Respondents 66 58 26 150

Non respondents 9 14 27 50

Total 75 72 53 200

X (observed − expected)2 cells

expected

[27 − (.25)(53)]2 [66 − (.75)(75)]2 [9 − (.25)(75)]2 + + ··· + (.75)(75) (.25)(75) (.25)(53) = 1.69 + 5.07 + 0.30 + 0.89 + 4.76 + 14.27 =

= 26.97 Comparing the test statistic to a χ2 distribution with 2 df, the p-value is 1.4 × 10−6 . This is strong evidence against the null hypothesis that the three groups have the same response rates. The hypothesis test indicates that the nonresponse is not MCAR, because response rates appear to be related to GPA. We do not know whether the nonresponse is MAR, or whether is it nonignorable. (c) ni 3 X X SSB = (¯ yi − y¯)2 = 9303.1 i=1 j=1

MSW = s2 = 403.6. The ANOVA table is as follows: Source Between groups Within groups Total, about mean

df 2 147 149

SS 9303.1 59323.0 68626.1

MS 4651.5 403.6

F 11.5

p-value 0.0002

Both the nonresponse rate and the TV viewing appear to be related to GPA, so it would be a reasonable variable to consider for weighting class adjustment or poststratification. (d) The initial weight for each person in the sample is 2000/200=10. After increasing the weights for the respondents in each class to adjust for the nonrespondents, the

167 weight for each respondent with GPA ≥ 3 is sum of weights for sample sum of weights for respondents

initial weight ×

Sample Size 75 72 53 200

Number of Respondents (nR ) 66 58 26 150

Weight for each Respondent (w) 11.36364 12.41379 20.38462

75(10) 66(10) = 11.36364. = 10 ×

y¯ 32 41 54

wnR y¯ 24000 29520 28620 82140

wnR 750 720 530 2000

Then tˆwc = 82140 and y¯wc = 82140/2000 = 41.07. The weighting class adjustment leads to a higher estimate of average viewing time, because the GPA group with the highest TV viewing also has the most nonresponse. (e) The poststratified weight for each respondent with GPA ≥ 3 is wpost = initial weight ×

700 population count = 10 × = 10.60606. sum of respondent weights (10)(66)

Here, nR denotes number of respondents. nR 66 58 26 150

Population Count 700 800 500 2000

wpost 10.60606 13.79310 19.2307

y¯ 32 41 54

wpost y¯nR 22400 32800 27000 82200

wpost nR 700 800 500 2000

The last column is calculated to check the weights constructed—the sum of the poststratified weights in each poststratum should equal the population count for that poststratum. tˆpost = 82140 and 82140 = 41.07. yˆpost = 2000 8.6 (a) For this exercise, we classify the missing data in the “Other/Unknown” category. Typically, raking would be used in situations in which the classification variables were known (and known to be accurate) for all respondents.

168

CHAPTER 8. NONRESPONSE

Ph.D. Master’s Other/Unknown Industry Academia Government Other/Unknown

Population 10235 7071 1303 5397 6327 2047 4838

Respondents 3036 1640 325 1809 2221 880 91

Response Rate (%) 30 23 25 34 35 43 19

These response rates are pretty dismal. The nonresponse does not appear to be MCAR, as it differs by degree and by type of employment. I doubt that it is MAR—I think that more information than is known from this survey would be needed to predict the nonresponse. (b) The cell counts from the sample are: PhD non-PhD

Industry 798 1011 1809

Academia 1787 434 2221

Other 451 520 971

3036 1965 5001

The initial sum of weights for each cell are: PhD non-PhD

Industry 2969.4 3762.0 6731.4

Academia 6649.5 1614.9 8264.5

Other 1678.2 1934.9 3613.1

11297.1 7311.9 18609.0

After adjusting for the population row counts (10235 for Ph.D. and 8374 for nonPh.D.) the new table is: PhD non-PhD

Industry 2690.2 4308.5 6998.7

Academia 6024.4 1849.5 7873.9

Other 1520.4 2216.0 3736.4

10235 8374 18609

Raking to the population column totals (Industry, 5397; Academia, 6327; Other, 6885) gives: PhD non-PhD

Industry 2074.6 3322.4 5397.0

Academia 4840.8 1486.2 6327.0

Other 2801.6 4083.4 6885.0

9717.0 8892.0 18609.0

As you can see, the previous two tables are still far apart. After iterating, the final table of the weight sums is:

169

PhD non-PhD

Industry 2239.2 3157.8 5397.0

Academia 4980.6 1346.4 6327.0

Other 3015.2 3869.8 6885.0

10235.0 8374.0 18609.0

The raking has dramatically increased the weights in the “Other” employment category. To calculate the proportions using the raking weights, create a new variable weight. For respondents with PhD’s who work in industry, weight = 2239.2/798 = 2.806. For the question, “Should the ASA develop some sort of certification?” the estimated percentages are:

No response Yes Possibly No opinion Unlikely No

Without Weights 0.2 26.4 22.3 5.4 6.7 39.0

With Raking Weights 0.3 25.8 22.3 5.4 6.9 39.3

(c) I think such a conclusion is questionable because of the very high nonresponse rate. This survey is closer to a self-selected opinion poll than to a probability sample. 8.7 Discipline Literature Classics Philosophy History Linguistics Political Science Sociology

Response Rate (%) 69.5 71.2 73.1 71.5 73.9 69.0 71.4

Female Members (%) 38 27 18 19 36 13 26

The model implicitly adopted in Example 4.3 was that nonrespondents within each stratum were similar to respondents in that stratum. We can use a χ2 test to examine whether the nonresponse rate varies among strata. The observed counts are given in the following table, with expected counts in parentheses:

170

CHAPTER 8. NONRESPONSE

Literature Classics Philosophy History Linguistics Political Science Sociology

Respondent 636 (651.6) 451 (450.8) 481 (468.6) 611 (608.9) 493 (475.0) 575 (593.2) 588 (586.8) 3835

Nonrespondent 279 (263.4) 182 (182.2) 177 (189.4) 244 (246.1) 174 (192.0) 258 (239.8) 236 (237.2) 1550

915 633 658 855 667 833 824 5385

The Pearson test statistic is X2 =

X (observed − expected)2 cells

expected

= 6.8

Comparing the test statistic to a χ26 distribution, we calculate p-value 0.34. There is no evidence that the response rates differ among strata. The estimated correlation coefficient of the response rate and the percent female members is 0.19. Performing a hypothesis test for association (Pearson correlation, Spearman correlation, or Kendall’s τ ) gives p-value > .10. There is no evidence that the response rate is associated with the percentage of members who are female. 8.12 (a) The overall response rate, using the file teachmi.dat, was 310/754=0.41. (b) As with many nonresponse problems, it’s easy to think of plausible reasons why the nonresponse bias might go either direction. The teachers who work many hours may be working so hard they are less likely to return the survey, or they may be more conscientious and thus more likely to return it. (c) The means and variances from the file teachnr.dat (ignoring missing values) are responses y¯ s2 Vˆ (¯ y)

hrwork 26 36.46 2.61 0.10

size 25 24.92 25.74 1.03

preprmin 26 160.19 3436.96 132.19

assist 26 152.31 49314.46 1896.71

The corresponding estimates from teachers.dat, the original cluster sample, are: yˆ¯r Vˆ (yˆ¯r )

hrwork 33.82 0.50

size 26.93 0.57

preprmin 168.74 70.57

assist 52.00 228.96

8.14 (a) We are more likely to delete an observation if the value of xi is small. Since xi and yi are positively correlated, we expect the mean of y to be too big. (b) The population mean of acres92 is y¯U = 308582.

171 8.16 We use the approximations from Chapter 3 to obtain: N X Zi Ri wi yi P N (Z R w − φ ) i i i i 1 − i=1N E[yˆ¯R ] = E i=1 N X X φi Z R w i i i i=1

N X

≈

i=1

φi yi

i=1 N X

φi

i=1

≈

N 1 X φi yi . N φ¯U i=1

Thus the bias is Bias [yˆ¯R ] ≈

N 1 X (φi − φ¯U )yi N φ¯U i=1

=

N 1 X (φi − φ¯U )(yi − y¯U ) N φ¯U

≈

1 Cov (φi , yi ) φ¯U

i=1

8.17 The argument is similar to the previous exercise. If the classes are sufficiently large, then E[1/φ˜c ] ≈ 1/φ¯c . 8.19 V (yˆ¯wc )

"

# N N n1 1 X n2 1 X = V Zi Ri xi yi + Zi Ri (1 − xi )yi n n1R n n2R i=1 i=1 #) ( " ¯ N N ¯ n1 1 X n2 1 X = E V Zi Ri xi yi + Zi Ri (1 − xi )yi ¯¯Z1 , . . . , ZN n n1R n n2R i=1 i=1 #) ( " ¯ N N X ¯ n1 1 n2 1 X +V E Zi Ri xi yi + Zi Ri (1 − xi )yi ¯¯Z1 , . . . , ZN n n1R n n2R i=1 i=1 ( " P #) ¯ PN N n1 i=1 Zi Ri xi yi n2 i=1 Zi Ri (1 − xi )yi ¯¯ = E V + PN PN ¯Z1 , . . . , ZN n n i=1 Zi Ri xi i=1 Zi Ri (1 − xi ) ( " P #) ¯ P ¯ n1 N n2 N i=1 Zi Ri xi yi i=1 Zi Ri (1 − xi )yi ¯ +V E . + PN PN ¯Z1 , . . . , ZN n n Z R x Z R (1 − x ) i i i i i i i=1 i=1

172

CHAPTER 8. NONRESPONSE

We use the ratio approximations from Chapter 4 to find the approximate expected values and variances. " P # ¯ ¯ n1 N i=1 Zi Ri xi yi ¯ E PN ¯Z1 , . . . , ZN n i=1 Zi Ri xi "P Ã !¯ # PN N ¯ n1 i=1 Zi Ri xi yi i=1 Zi [Ri − φ1 ]xi ¯Z1 , . . . , ZN = E 1− P PN ¯ n φ1 N Z x Z R x i i i i i i=1 i=1 "N # ¯ PN N X X ¯ Z [R − φ ]x 1 i i 1 i ¯Z1 , . . . , ZN = E Zi Ri xi yi − Zi Ri xi yi i=1 PN ¯ nφ1 Z R x i=1 i i i i=1 i=1 ≈ ≈

N N 1X 1 X Zi xi yi − Zi V (Ri )xi yi n (nφ1 )2

1 n

i=1 N X

i=1

Zi xi yi .

i=1

Consequently, ( "

#) ¯ P PN ¯ Z R x y Z R (1 − x )y n n1 N i i i i i i i i 2 ¯Z1 , . . . , ZN + V E Pi=1 Pi=1 ¯ N N n n Z R x Z R (1 − x ) i i=1 i i i i=1 i i ) ( N N 1X 1X Zi xi yi + Zi (1 − xi )yi ≈ V n n i=1 i=1 ³ n ´ Sy2 = 1− , N n the variance that would be obtained if there were no nonresponse. For the other term, # " P ¯ ¯ Z R x y n1 N i i i i ¯Z1 , . . . , ZN V Pi=1 ¯ N n Z R x i i i i=1 Ã !¯ " # PN N ¯ Z [R − φ ]x 1 X i i 1 i ¯Z1 , . . . , ZN = V Zi Ri xi yi 1 − i=1 PN ¯ nφ1 Z R x i=1 i i i i=1 ≈

N 1 X Zi V (Ri )xi yi2 (nφ1 )2 i=1

N

≈

φ1 (1 − φ1 ) X Zi xi yi2 . (nφ1 )2 i=1

173 Thus, ( "

#) ¯ P P ¯ n1 N n2 N i=1 Zi Ri xi yi i=1 Zi Ri (1 − xi )yi ¯ E V + PN PN ¯Z1 , . . . , ZN n n Z R x Z R (1 − x ) i i i i i i i=1 i=1 ) ( N N X X φ (1 − φ ) φ1 (1 − φ1 ) 2 2 Zi xi yi2 + Zi (1 − xi )yi2 ≈ E (nφ1 )2 (nφ2 )2 i=1

=

φ1 (1 − φ1 ) nφ21

N X i=1

xi yi2 +

i=1

φ2 (1 − φ2 ) nφ22

N X

(1 − xi )yi2 .

i=1

8.20 (a) Respondents are divided into 5 classes on the basis of the number of nights the respondent was home during the 4 nights preceding the survey call. The sampling weight wi for respondent i is then multiplied by 5/(ki + 1). The respondents with k = 0 were only home on one of the five nights and are assigned to represent their share of the population plus the share of four persons in the sample who were called on one of their “unavailable” nights. The respondents most likely to be home have k = 4; it is presumed that all persons in the sample who were home every night were reached, so their weights are unchanged. (b) This method of weighting is based on the premise that the most accessible persons will tend to be overrepresented in the survey data. The method is easy to use, theoretically appealing, and can be used in conjunction with callbacks. But it still misses people who were not at home on any of the five nights, or who refused to participate in the survey. Since in many surveys done over the telephone, nonresponse is due in large part to refusals, the HPS method may not be helpful in dealing with all nonresponse. Values of k may also be in error, because people may err when recalling how many evenings they were home.

174

CHAPTER 8. NONRESPONSE

Chapter 9

Variance Estimation in Complex Surveys 9.1 All of the methods discussed in this chapter would be appropriate. Note that the replication methods might slightly overestimate the variance because sampling is done without replacement, but since the sampling fractions are fairly small we expect the overestimation to be small. 9.2 We calculate y¯ = 8.23333333 and s2 = 15.978, so s2 /30 = 0.5326. For jackknife replicate j, the jackknife weight is wj(j) = 0 for observation j and wi(j) = (30/29)wi = (30/29)(100/30) = 3.44828 for i 6= j. Using the jackknife weights, we find y¯(1) = 8.2413, . . . , y¯(30) = 8.20690, so, by (9.8), 30

29 X [¯ y(j) − y¯]2 = 0.5326054. VˆJK (¯ y) = 30 j=1

9.3 Here is the empirical cdf Fˆ (y): Obs 1 2 3 4 5 6 7 8 9 10 11 12

y

COUNT

2 3 4 5 6 7 8 9 10 12 14 15

3.3333 10.0000 3.3333 6.6667 16.6667 10.0000 10.0000 6.6667 10.0000 6.6667 6.6667 6.6667

PERCENT 3.3333 10.0000 3.3333 6.6667 16.6667 10.0000 10.0000 6.6667 10.0000 6.6667 6.6667 6.6667

CUM_FREQ 3.333 13.333 16.667 23.333 40.000 50.000 60.000 66.667 76.667 83.333 90.000 96.667

175

CUM_PCT 3.333 13.333 16.667 23.333 40.000 50.000 60.000 66.667 76.667 83.333 90.000 96.667

epmf

ecdf

0.03333 0.10000 0.03333 0.06667 0.16667 0.10000 0.10000 0.06667 0.10000 0.06667 0.06667 0.06667

0.03333 0.13333 0.16667 0.23333 0.40000 0.50000 0.60000 0.66667 0.76667 0.83333 0.90000 0.96667

176 13

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS 17

3.3333

3.3333

100.000

100.000

0.03333

1.00000

Note that Fˆ (7) = .5, so the median is θˆ0.5 = 7. No interpolation is needed. As in Example 9.12, Fˆ (θ1/2 ) is the sample proportion of observations that take on value at most θ1/2 , so ¶ µ ³ n ´ 0.25862069 30 0.25862069 ˆ ˆ ˆ V [F (θ1/2 )] = 1 − = 1− = 0.006034483. N n 100 30 This is a small sample, so we use the t29 critical value of 2.045 to calculate q 2.045 Vˆ [Fˆ (θˆ1/2 )] = 0.1588596. The lower confidence bound is Fˆ −1 (.5 − 0.1588596) = Fˆ −1 (0.3411404) and the upper confidence bound for the median is Fˆ −1 (.5 + 0.1588596) = Fˆ −1 (0.6588596). Interpolating, we have that the lower confidence bound is 5+

0.34114 − 0.23333 (6 − 5) = 5.6 0.4 − 0.23333

and the upper confidence bound is 8+

0.6588596 − 0.6 (9 − 8) = 8.8. 0.666667 − 0.6

Thus an approximate 95% CI is [5.6, 8.8]. SAS code below gives approximately the same interval: data srs30; input y @@; wt = 100/30; datalines; 8 5 2 6 6 3 8 7 10 14 3 4 17 10 ;

6 10 7 15 6 14 12 7

/* We use two methods.

9 15 8 12

3 9

5

6

First, the "hand" calculations" */

/* Find the empirical cdf */ proc freq data=srs30; tables y / out = htpop_epmf outcum; weight wt; run;

177

data htpop_epmf; set htpop_epmf; epmf = percent/100; ecdf = cum_pct/100; run; proc print data=htpop_epmf; run; /* Find the variance of \hat{F}(median) */ data calcvar; set srs30; ui = 0; if y le 7 then ui = 1; ei = ui - .5; proc univariate data=calcvar; var ei; run; /* Calculate the stratified variance for the total of variable ei */ proc surveymeans data=calcvar total = 100 sum stderr; weight wt; var ei; run; /* Method 2: Use sas directly to find the CI */ proc surveymeans data=srs30 total=100 percentile=(25 50 75) nonsymcl; weight wt; var y; run; Quantiles Variable Percentile Estimate Std Error 95% Confidence Limits ___________________________________________________________________ y 25% Q1 5.100000 0.770164 2.65604792 5.8063712 50% Median 7.000000 0.791213 5.64673564 8.8831609 75% Q3 9.833333 1.057332 7.16875624 11.4937313 ___________________________________________________________________

178

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

9.5

θˆ1 θˆ2 θˆ3 θˆ4 θˆ5 θˆ6 θˆ7

(a) Age 0.12447 0.09528 0.08202 0.21562 0.21660 0.07321 0.02402

(b) Violent 0.52179 0.43358 0.36733 0.37370 0.42893 0.48006 0.51201

(c) Bothpar 0.29016 0.31309 0.34417 0.25465 0.30181 0.30514 0.27299

(d) Male 0.90160 0.84929 0.99319 0.96096 0.91314 0.96786 0.96558

(e) Hispanic 0.30106 0.20751 0.17876 0.08532 0.14912 0.15752 0.25170

(f) Sinpar 0.55691 0.52381 0.51068 0.55352 0.54480 0.55350 0.54490

(g) Drugs 0.90072 0.84265 0.82960 0.80869 0.74491 0.82232 0.84977

θˆ θ˜ ˆ ˆ V1 (θ) ˆ Vˆ2 (θ)

0.12330 0.11875 0.00076 0.00076

0.44325 0.44534 0.00055 0.00055

0.29743 0.29743 0.00012 0.00012

0.93119 0.93594 0.00036 0.00036

0.18877 0.19014 0.00072 0.00072

0.54108 0.54116 0.00004 0.00004

0.82821 0.82838 0.00032 0.00032

ˆ = 11.41946, yˆ ˆ = 10.3B ˆ = 117.6, and SE (yˆ 9.6 From Exercise 3.4, B ¯r = tx B ¯r ) = ˆ(·) = 11.41937, yˆ 3.98. Using the jackknife, we have B ¯r(·) = 117.6, and SE (yˆ ¯r ) = √ 10.3 0.1836 = 4.41. The jackknife standard error is larger, partly because it does not include the fpc. 9.7 We use

10

n−1X (yˆ ¯r(j) − yˆ ¯r )2 . VˆJK (yˆ¯r ) = n j=1

The yˆ¯r(j) ’s for returnf and hadmeas are given in the following table: School, j 1 2 3 4 5 6 7 8 9 10

returnf, yˆ¯r(j) 0.5822185 0.5860165 0.5504290 0.5768984 0.5950112 0.5829014 0.5726580 0.5785320 0.5650470 0.5986785

hadmeas, yˆ ¯r(j) 0.4253903 0.4647582 0.4109223 0.4214941 0.4275614 0.4615285 0.4379689 0.4313120 0.4951728 0.4304341

For returnf, 10

9 X VˆJK (yˆ¯r ) = (yˆ ¯r(j) − 0.5789482)2 = 0.00160 10 j=1

For hadmeas, 10

9 X VˆJK (yˆ¯r ) = (yˆ ¯r(j) − 0.4402907)2 = 0.00526 10 j=1

179 ˆ(·) = .9865651 and VˆJK (B) ˆ = 3.707 × 10−5 . With the fpc, the 9.8 We have B ˆ = 3.071 × 10−5 ; the linearization variance linearization variance estimate is VˆL (B) q estimate if we ignore the fpc is 3.071 × 10−5 / 1 −

300 3078

= 3.232 × 10−5 .

9.9 The median weekday greens fee for nine holes is θˆ = 12. For the SRS of size 120, (.5)(.5) V [Fˆ (θ0.5 )] = = 0.0021. 120 An approximate 95% confidence interval for the median is therefore √ √ [Fˆ −1 (.5 − 1.96 .0021), Fˆ −1 (.5 + 1.96 .0021)] = [Fˆ −1 (.4105), Fˆ −1 (.5895)]. We have the following values for the empirical distribution function: y Fˆ (y)

10.25 .3917

10.8 .4000

11 .4167

11.5 .4333

y ˆ F (y)

13 .5250

14 .5417

15 .5833

16 .6000

12 .5167

Interpolating, .4105 − .4 Fˆ −1 (.4105) = 10.8 + (11 − 10.8) = 10.9 .4167 − .4 and

.5895 − .5833 Fˆ −1 (.5895) = 15 + (16 − 15) = 15.4. .6 − .5833

Thus, an approximate 95% confidence interval for the median is [10.9, 15.4]. Note: If we apply the bootstrap to these data, we get 1000 1 X ˆ∗ θr = 12.86 1000 r=1

with standard error 1.39. This leads to a 95% CI of [10.1, 15.6] for the median. 9.13 (a) Since h00 (t) = −2t, the remainder term is µ ¶ Z x Z x x2 a2 00 2 (x − t)h (t)dt = −2 (x − t)dt = −2 x − − ax + = −(x − a)2 . 2 2 a a Thus, h(ˆ p) = h(p) + h0 (p)(ˆ p − p) − (ˆ p − p)2 = p(1 − p) + (1 − 2p)(ˆ p − p) − (ˆ p − p)2 . (b) The remainder term is likely to be smaller than the other terms because it has (ˆ p − p)2 in it. This will be small if pˆ is close to p.

180

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

(c) To find the exact variance, we need to find V (ˆ p − pˆ2 ), which involves the fourth moments. For an SRSWR, X = nˆ p ∼ Bin(n, p), so we can find the moments using the moment generating function of the Binomial: MX (t) = (pet + q)n So, E(X) =

2

E(X ) =

¯ ¯

0 MX (t)¯¯ t=0

t

= n(pe + q)

n−1

¯ ¯ pe ¯ t¯

= np

t=0

¯ ¯

00 MX (t)¯¯ t=0 t

n−2

= [n(n − 1)(pe + q)

t 2

t

(pe ) + n(pe + q)

n−1

¯ ¯ pe ]¯¯ t

t=0

= n(n − 1)p2 + np = n2 p2 + np(1 − p) ¯ ¯ 3 000 E(X ) = MX (t)¯¯ = np(1 − 3p + 3np + 2p2 − 3np2 + n2 p2 ) t=0

E(X 4 ) = np(1 − 7p + 7np + 12p2 − 18np2 + 6n2 p2 − 6p3 + 11np3 − 6n2 p3 + n3 p3 ) Then, V [ˆ p(1 − pˆ)] = V (ˆ p) + V (ˆ p2 ) − 2Cov (ˆ p, pˆ2 )

£ ¤ = E[ˆ p2 ] − p2 + E[ˆ p4 ] − [E(ˆ p2 )]2 − 2E pˆ3 + 2pE(ˆ p2 ) p(1 − p) = n p + 3 (1 − 7p + 7np + 12p2 − 18np2 + 6n2 p2 − 6p3 + 11np3 − 6n2 p3 + n3 p3 ) n · ¸ p(1 − p) 2 2 − p + n · ¸ p p(1 − p) 2 2 2 2 2 −2 2 (1 − 3p + 3np + 2p − 3np + n p ) + 2p p + n n p(1 − p) = (1 − 4p + 4p2 ) n 1 1 + 2 (−2p + 14p2 − 22p3 + 12p4 ) + 3 (p − 7p2 + 12p3 − 6p4 ) n n Note that the first term is (1 − 2p)2 V (ˆ p)/n, and the other terms are (constant)/n2 3 and (constant)/n . The remainder terms become small relative to the first term when n is large. You can see why statisticians use the linearization method so frequently: even for this simple example, the exact calculations of the variance are nasty.

181 Note that with an SRS without replacement, the result is much more complicated. Results from the following paper may be used to find the moments. Finucan, H. M., Galbraith, R. F., and Stone, M. (1974). Moments Without Tears in Simple Random Sampling from a Finite Population Biometrika, 61, 151–154. 9.14 (a) Write B1 = h(txy , tx , ty , tx2 , N ), where h(a, b, c, d, e) =

a − bc/e ea − bc . = 2 d − b /e ed − b2

The partial derivatives, evaluated at the population quantities, are: ∂h ∂a

=

∂h ∂b

= − = = =

∂h ∂c ∂h ∂d ∂h ∂e

= = = = =

e ed − b2 ea − bc c + 2b ed − b2 (ed − b2 )2 2bB1 c + − 2 ed − b · ed − b2 ¸ e b c b − − B1 − B1 ed − b2 e e e e − (B0 − B1 x ¯U ) ed − b2 b e − =− x ¯U 2 ed − b ed − b2 e − B1 ed − b2 a d(ea − bc) − 2 ed − b (ed − b2 )2 a dB1 − ed − b2 ed − b2 e B0 x ¯U . ed − b2

182

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

The last equality follows from the normal equations. Then, by linearization, ˆ1 − B1 B ∂h ˆ ∂h ˆ ∂h ˆ ≈ (txy − txy ) + (tx − tx ) + (ty − ty ) ∂a ∂b ∂c ∂h ˆ ∂h (N − N ) + (tˆx2 − tx2 ) + ∂d ∂e £ N = tˆxy − txy − (B0 − B1 x ¯U )(tˆx − tx ) N tx2 − (tx )2

i ˆ − N) −¯ xU (tˆy − ty ) − B1 (tˆx2 − tx2 ) + B0 x ¯U (N

=

" # X © ª N wi xi yi − (B0 − B1 x ¯U )xi − x ¯U yi − B1 x2i + B0 x ¯U N tx2 − (tx )2 i∈S

N − [txy − tx (B0 − B1 x ¯U ) − x ¯U ty − B1 tx2 + B0 N x ¯U ] N tx2 − (tx )2 X N = wi (yi − B0 − B1 xi ) (xi − x ¯U ). N tx2 − (tx )2 i∈S

9.15 (a) Write S2 =

1 N −1

N X i=1

2 µ ¶ N X 1 t2 yi − 1 = yj t1 − N t3 − 1 t3 j=1

(b) Substituting, we have Sˆ2 =

1 tˆ3 − 1

µ ¶ ˆ22 t tˆ1 − tˆ3

(c) We need to find the partial derivatives: ∂h ∂t1 ∂h ∂t2 ∂h ∂t3

=

1 t3 − 1

t2 t3 (t3 − 1) µ ¶ 1 t2 1 t2 = − t1 − + 2 (t3 − 1) t3 t3 − 1 t23 = −2

Then, by linearization, Sˆ2 − S 2 ≈

∂h ˆ ∂h ˆ ∂h ˆ (t1 − t1 ) + (t2 − t2 ) + (t3 − t3 ) ∂t1 ∂t2 ∂t3

183 Let qi = = =

∂h 2 ∂h ∂h yi + yi + ∂t1 ∂t2 ∂t3 µ ¶ 1 t 1 t2 1 t2 2 2 yi − 2 yi − t1 − + 2 t3 − 1 t3 (t3 − 1) (t3 − 1) t3 t3 − 1 t23 ¶ µ µ ¶ 1 t2 1 t2 t2 + 2 yi2 − 2 yi − t1 − t3 − 1 t3 (t3 − 1) t3 t3

9.16 (a) Write R = h(t1 , . . . , t6 ), where d − ab/f

h(a, b, c, d, e, f ) = p

(c −

a2 /f )(e

−

b2 /f )

=p

f d − ab (f c − a2 )(f e − b2 )

.

The partial derivatives, evaluated at the population quantities, are: µ ¶ 1 ∂h a(f d − ab) = p −b + ∂a f c − a2 (f c − a2 )(f e − b2 ) −ty tx R = + N (N − 1)Sx Sy N (N − 1)Sx2 µ ¶ 1 b(f d − ab) ∂h = p −a + ∂b f e − b2 (f c − a2 )(f e − b2 ) ty R −ty + = N (N − 1)Sx Sy N (N − 1)Sy2 µ ¶ 1 f (f d − ab) ∂h = − p ∂c f c − a2 2 (f c − a2 )(f e − b2 ) 1 R = − 2 (N − 1)Sx2 ∂h f = p ∂d (f c − a2 )(f e − b2 ) µ ¶ ∂h 1 f (f d − ab) p = − ∂e f e − b2 2 (f c − a2 )(f e − b2 ) R 1 = − 2 (N − 1)Sy2 ¶ µ d c ∂h f d − ab e = p + − p ∂f (f c − a2 )(f e − b2 ) 2 (f c − a2 )(f e − b2 ) f e − b2 f c − a2 µ ¶ ty 2 txy R tx2 = − + N (N − 1)Sx Sy 2 N (N − 1)Sy2 N (N − 1)Sx2

184

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

Then, by linearization, ˆ−R R ∂h ˆ ∂h ˆ ∂h ˆ ≈ (tx − tx ) + (ty − ty ) + (t 2 − tx2 ) ∂a ∂b ∂c x ∂h ∂h ˆ ∂h ˆ + (tˆxy − txy ) + (t 2 − ty2 ) + (N − N ) ∂d ∂e y ∂f ·µ ¶ ¸ tx RSy ˆ ty RSx ˆ 1 = −ty + (tx − tx ) + (−tx + )(ty − ty ) N (N − 1)Sx Sy Sx Sy 1 1 R (tˆx2 − tx2 ) + (tˆxy − txy ) − 2 2 (N − 1)Sx (N − 1)Sx Sy R 1 − (tˆ 2 − ty2 ) 2 (N − 1)Sy2 y · µ ¶¸ ty 2 txy R tx2 ˆ − N) + − + (N N (N − 1)Sx Sy 2 N (N − 1)Sy2 N (N − 1)Sx2 ·µ ¶ µ ¶ tx RSy ˆ ty RSx ˆ 1 −ty + (tx − tx ) + −tx + (ty − ty ) = N (N − 1)Sx Sy Sx Sy N RSy ˆ N RSx ˆ − (tx2 − tx2 ) + N (tˆxy − txy ) − (ty2 − ty2 ) 2Sx 2Sy µ ¶¾ ¸ ½ R ty2 Sx tx2 Sy ˆ − N) + (N + txy − 2 Sy Sx This is somewhat easier to do in matrix terms. Let ·µ ¶ µ ¶ x ¯U RSy RSy y¯U RSx RSx δ = −¯ yU + , −¯ xU + ,− ,− , 1, Sx Sy 2Sx 2Sy µ ¶¸ txy R ty2 Sx tx2 Sy T − + , N 2N Sy Sx then ˆ ≈ Cov (R)

1 δ T Cov (ˆt)δ. [(N − 1)Sx Sy ]2

9.17 Write the function as h(a1 , . . . , aL , b1 , . . . , bL ). Then ¯ ∂h ¯¯ =1 ∂al ¯t1 ,...,tL ,N1 ,...,NL and

¯ tl ∂h ¯¯ =− . ¯ ∂bl t1 ,...,tL ,N1 ,...,NL Nl

Consequently, ˆ1 , . . . , N ˆL ) ≈ t + h(tˆ1 , . . . , tˆL , N

L L X X tl ˆ (tˆl − tl ) − (Nl − Nl ) Nl l=1

l=1

185 and V (tˆpost ) ≈ V

¶¸ ·X L µ tl ˆ tˆl − Nl . Nl l=1

9.18 From (9.5), R

X ˆ = 1 1 ˆ 2. Vˆ2 (θ) (θˆr − θ) RR−1 r=1

Without loss of generality, let y¯U = 0. We know that y¯ =

PR

¯r /R. r=1 y

Suppose the random groups are independent. Then y¯1 , . . . , y¯R are independent and identically distributed random variables with E[¯ yr ] = 0, V [¯ yr ] = E[¯ yr2 ] =

S2 = κ2 (¯ y1 ), m

E[¯ yr4 ] = κ4 (¯ y1 ). We have "

R

X 1 E (¯ yr − y¯)2 R(R − 1)

#

R

=

r=1

=

X £ ¤ 1 E y¯r2 − (¯ y )2 R(R − 1) 1 R(R − 1)

r=1 R X

[V (¯ yr ) − V (¯ y )]

r=1 R · X

S2 S2 − m n

¸

=

1 R(R − 1)

=

¸ R · X 1 S2 S2 R − R(R − 1) n n

r=1

r=1

=

S2 n

.

186

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

Also, ( )2 R X E (¯ yr − y¯)2 r=1

( )2 R X y¯r2 − R¯ y2 = E " = E

r=1 R X R X

y¯r2 y¯s2

2

− 2R¯ y

r=1 s=1

R X

# y¯r2

2 4

+ R y¯

r=1

X X X X 2 1 = E y¯r2 y¯s2 − y¯r2 y¯s2 + 2 y¯j y¯k y¯r y¯s R R r s r=1 s=1 j k µ ¶X µ ¶ R R X R X 2 1 2 3 = E 1− + 2 y¯r4 + 1 − + 2 y¯r2 y¯s2 R R R R r=1 r=1 s6=r µ ¶ µ ¶ 2 1 3 2 = 1 − + 2 Rκ4 (¯ y1 ) + 1 − + 2 R(R − 1)κ22 (¯ y1 ) R R R R R X R µ X

¶

Consequently, h i ˆ E Vˆ 2 (θ) 2

= =

·µ ¶ µ ¶ ¸ 1 3 1 2 2 2 1 − + 2 Rκ4 (¯ y1 ) + 1 − + 2 R(R − 1)κ2 (¯ y1 ) R2 (R − 1)2 R R R R 1 1 (R2 − 2R + 3)κ22 (¯ y1 ) κ4 (¯ y1 ) + 3 3 R R (R − 1)

and " V

R

X 1 (¯ yr − y¯)2 R(R − 1)

# =

r=1

=

" CV 2

R

X 1 (¯ yr − y¯)2 R(R − 1)

# =

r=1

=

µ 2 ¶2 1 R2 − 2R + 3 2 S κ4 (¯ y1 ) + κ2 (¯ y1 ) − 3 3 R R (R − 1) n µ ¶ µ ¶2 2 S2 1 R2 − 2R + 3 S 2 − κ4 (¯ y1 ) + R3 R3 (R − 1) m Rm

µ ¶2 µ 2 ¶2 1 R2 − 2R + 3 S 2 S κ (¯ y ) + − 4 1 R3 R3 (R − 1) m Rm µ 2 ¶2 S Rm · ¸ 1 κ4 (¯ y1 )m2 R − 3 − . R S4 R−1

187 We now need to find κ4 (¯ y1 ) = E[¯ yr4 ] to finish the problem. A complete argument giving the fourth moment for an SRSWR is given by Hansen, M. H., Hurwitz, W. N., and Madow, W. G. (1953). Sample Survey Methods and Theory, Volume 2. New York: Wiley, pp. 99-100. They note that y¯r4 =

X X 1 X 4 3 y + 4 y y + 3 yi2 yj2 j i i m4 i∈Sr

+6

i6=j

X

i6=j

yi2 yj yk +

i6=j6=k

so that

X

yi yj yk yl

i6=j6=k6=l

N

κ4 (¯ y1 ) =

E[¯ yr4 ]

X 1 m−1 4 = 3 (yi − y¯U )4 + 3 S . m (N − 1) m3 i=1

This results in " CV

2

R

X 1 (¯ yr − y¯)2 R(R − 1)

#

· ¸ 1 κ4 (¯ y1 )m2 R − 3 − R S4 R−1 · ¸ m−1 R−3 1 κ +3 . − R m m3 R−1

=

r=1

=

The number of groups, R, has more impact on the CV than the group size m : the random group estimator of the variance is unstable if R is small. 9.19 First note that y¯str (αr ) − y¯str = = =

H X Nh h=1 H X h=1 H X h=1

N

yh (αr ) − µ

H X Nh yh1 + yh2 h=1

N

2

Nh αrh + 1 αrh − 1 yh1 − yh2 N 2 2 Nh yh1 − yh2 αrh . N 2

¶ −

H X Nh yh1 + yh2 h=1

N

2

188

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

Then R

1 X [¯ ystr (αr ) − y¯str ]2 VˆBRR (¯ ystr ) = R r=1 ¸ R ·X H X Nh 1 yh1 − yh2 2 = αrh R N 2 r=1

h=1

R H H 1 X X X Nh yh1 − yh2 N` y`1 − y`2 = αrh αr` R N 2 N 2 r=1 h=1 `=1 µ ¶ R H 1 X X Nh 2 2 (yh1 − yh2 )2 = αrh R N 4 r=1 h=1 H H X X

1 + R =

h=1 `=1,`6=h

R Nh yh1 − yh2 N` y`1 − y`2 X αrh αr` N 2 N 2 r=1

¶ H µ X Nh 2 (yh1 − yh2 )2

N ˆ = Vstr (¯ ystr ).

4

h=1

PR

r=1 αrh αr`

The last step holds because

= 0 for ` 6= h.

9.20 As noted in the text, Vˆstr (¯ ystr ) =

¶ H µ X Nh 2 (yh1 − yh2 )2 N

h=1

4

.

Also, ˆ r ) = y¯str (αr ) = θ(α

H X αrh Nh h=1

so ˆ r ) − θ(−α ˆ θ(α r) = and, using the property

r=1 αrh αrk

R

r=1

= =

H X

αrh

h=1

PR

1 Xˆ 2 ˆ [θ(αr ) − θ(−α = r )] 4R

2 N

R

(yh1 − yh2 ) + y¯str

Nh (yh1 − yh2 ) N

= 0 for k 6= h, H

H

1 XXX Nh Nk αrh αrk (yh1 − yh2 )(yk1 − yk2 ) 4R N N r=1 h=1 k=1 ¶ R H µ 1 X X Nh 2 (yh1 − yh2 )2 4R N r=1 h=1 µ ¶ H 1 X Nh 2 (yh1 − yh2 )2 = Vˆstr (¯ ystr ). 4 N h=1

189 Similarly, R

1 X ˆ ˆ 2 + [θ(−α ˆ ˆ2 {[θ(αr ) − θ] r ) − θ] } 2R r=1 ¸2 · X ¸2 ¾ R ½· X H H αrh Nh 1 X −αrh Nh (yh1 − yh2 ) + (yh1 − yh2 ) = 2R 2 N 2 N r=1 h=1 h=1 µ ¶2 R X H 2 X αrh Nh 1 (yh1 − yh2 )2 = 2R 2 N r=1 h=1 ¶ µ H 1 X Nh 2 = (yh1 − yh2 )2 . 4 N h=1

9.21 Note that tˆ(αr ) = =

H X h=1 H X h=1

and [tˆ(αr )]2 =

· Nh

αrh (yh1 − yh2 ) + y¯h 2

Nh arh (yh1 − yh2 ) + tˆ 2

H H X X Nh Nk αrh αrk h=1 k=1 H X

+2tˆ

h=1

¸

4

(yh1 − yh2 )(yk1 − yk2 )

Nh αrh (yh1 − yh2 ) + tˆ2 . 2

Thus, tˆ(αr ) − tˆ(−αr ) =

H X

Nh αrh (yh1 − yh2 ),

h=1

[tˆ(αr )]2 − [tˆ(−αr )]2 = 2tˆ

H X

Nh αrh (yh1 − yh2 ),

h=1

and ˆ r ) − θ(−α ˆ θ(α r ) = (2atˆ + b)

H X h=1

Nh αrh (yh1 − yh2 ).

190

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

Consequently, using the balanced property

PR

r=1 αrh αrk

= 0 for k 6= h, we have

R

1 Xˆ 2 ˆ [θ(αr ) − θ(−α r )] 4R r=1

R

=

H

H

XXX 1 (2atˆ + b)2 Nh Nk αrh αrk (yh1 − yh2 )(yk1 − yk2 ) 4R r=1 h=1 k=1

=

1 (2atˆ + b)2 4

H X

Nh2 (yh1 − yh2 )2 .

h=1

Using linearization, h(tˆ) ≈ h(t) + (2at + b)(tˆ − t), so

ˆ = (2at − b)2 V (tˆ) VL (θ)

and ˆ = (2atˆ − b)2 1 VˆL (θ) 4 which is the same as

1 4R

PR

ˆ

r=1 [θ(αr )

H X

Nh2 (yh1 − yh2 )2 ,

h=1

2 ˆ − θ(−α r )] .

9.23 We can write

tˆpost = g(w, y, x1 , . . . , xL ) =

L X

Nl

X

wj xlj yj

j∈S

X

l=1

wj xlj

j∈S

Then, ∂g(w, y, x1 , . . . , xL ) ∂wi X N x w x y j lj j l li L X Nl xli yi j∈S X = − 2 wj xlj X l=1 w x j∈S j lj j∈S ( ) L X Nl xli yi Nl xli tˆyl = − ˆl ˆ2 N N

zi =

=

l=1 L X l=1

l

µ ¶ tˆyl Nl xli yi − . ˆl ˆl N N

.

191 Thus,

Ã Vˆ (tˆpost ) = Vˆ

X

! wi zi

.

i∈S

Note that this variance estimator differs from the one in Exercise 9.17, although they are asymptotically equivalent. 9.24 From Chapter 5, M MSB n N 2M N M − 1 2 S [1 + (M − 1)ICC] n M (N − 1) NM NM p(1 − p)[1 + (M − 1)ICC] n M

V (tˆ) ≈ N 2 = ≈

Consequently, the relative variance v can be written as β0 + β1 /t, where β0 = 1 1 − nM [1 + (M − 1)ICC] + nt N [1 + (M − 1)ICC]. 9.25 (a) From (9.2), ˆ ≈ V [B] = = = =

"½ ¾2 # ty ˆ 1 ˆ E − 2 (tx − tx ) + (ty − ty ) tx tx "½ ¾2 # t2y 1 ˆ 1 ˆ E − (tx − tx ) + (ty − ty ) t2x tx ty · µ ¶¸ 2 ty V (tˆx ) V (tˆy tˆx tˆy + 2 − 2Cov , t2x t2x ty tx ty ¸ · ¡ ¢ t2y V (tˆx ) V (tˆy B + 2 −2 V tˆx t2x t2x ty tx ty · ¸ t2y V (tˆy V (tˆx ) − 2 t2x t2y tx

Using the fitted model from (9.13), Vˆ (tˆx ) = a + b/tˆx tˆ2x and

Vˆ (tˆy ) = a + b/tˆy tˆ2y

Consequently, substituting estimators for the population quantitites, ¸ · b b 2 ˆ ˆ ˆ , V [B] = B a + − a − tˆy tˆx

192

CHAPTER 9. VARIANCE ESTIMATION IN COMPLEX SURVEYS

which gives the result. (b) When B is a proportion, · ˆ =B ˆ2 Vˆ [B]

¸ · ¸ ˆ ˆ b b ˆ 2 b − b = bB(1 − B) . − =B ˆ tˆy tˆx tˆx tˆx tˆx B

Chapter 10

Categorical Data Analysis in Complex Surveys 10.1 Many data sets used for chi-square tests in introductory statistics books use dependent data. See Alf and Lohr (2007) for a review of how books ignore clustering in the data. 10.3 (a) Observed and expected (in parentheses) proportions are given in the following table:

No

Abuse No Yes .7542 .1017 (.7109) (.1451)

Symptom Yes (b)

.0763 (.1196)

.0678 (.0244)

¸ · (.0678 − .0244)2 (.7542 − .7109)2 + ··· + X 2 = 118 .7109 .0244 = 12.8 · µ ¶ µ ¶¸ .7542 .0678 G2 = 2(118) .7542 ln + · · · + .0678 ln .7109 .0244 = 10.3.

Both p-values are less than .002. Because the expected count in the Yes-Yes cell is small, we also perform Fisher’s exact test, which gives p-value .0016. 10.4 (a) This is a test of independence. A sample of students is taken, and each student classified based on instructors and grade. (b) X 2 = 34.8. Comparing this to a χ23 distribution, we see that the p-value is 193

194CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS less than 0.0001. A similar conclusion follows from the likelihood ratio test, with G2 = 34.5. (c) Students are probably not independent–most likely, a cluster sample of students was taken, with the Math II classes as the clusters. The p-values in part (b) are thus lower than they should be. 10.5 The following table gives the value of θˆ for the 7 random groups: Random Group 1 2 3 4 5 6 7 Average std. dev.

θˆ 0.0132 0.0147 0.0252 -0.0224 0.0073 -0.0057 0.0135 0.0065 0.0158

Using the random group method, the standard error of θˆ is 0.0158/ so the test statistic is θˆ2 = 0.79. ˆ V (θ)

√ 7 = 0:0060,

Since our estimate of the variance from the random group method has only 6 df, we compare the test statistic to an F (1, 6) distribution rather than to a χ21 distribution, obtaining a p-value of 0.4. 10.6 (a) The contingency table (for complete data) is as follows:

Faculty Classified staff Administrative staff Academic professional

Break again? No Yes 65 167 55 459 11 75 9 58 140 759

232 514 86 67 899

Xp2 = 37.3; comparing to a χ23 distribution gives ρ-value < .0001. We can use the χ2 test for homogeneity because we assume product-multinomial sampling. (Class is the stratification variable.) (b) Using the weights (with the respondents who answer both questions), we estimate the probabilities as

195

No

Work No Yes 0.0832 0.0859

0.1691

0.6496 0.7328

0.8309 1.0000

Breakaga Yes

0.1813 0.2672

To estimate the proportion in the Yes-Yes cell, I used: pˆyy =

sum of weights of persons answering yes to both questions . sum of weights of respondents to both questions

Other answers are possible, depending on how you want to treat the nonresponse. (c) The odds ratio, calculated using the table in part (b), is 0.0832/0.0859 = 0.27065. 0.6496/0.1813 (Or, you could get 1/.27065 = 3.695.) The estimated proportions ignoring the weights are

No

Work No Yes 0.0850 0.0671

0.1521

0.6969 0.7819

0.8479 1.0000

breakaga Yes

0.1510 0.2181

Without weights the odds ratio is 0.0850/0.0671 = 0.27448 0.6969/0.1510 (or, 1/.27448 = 3.643). Weights appear to make little difference in the value of the odds ratio. (d) θˆ = (.0832)(.1813) − (.6496)(.0859) = −0.04068. (e) Using linearization, define qi = pˆ22 y11i + pˆ11 y12i − pˆ12 y21i − pˆ21 y22i where yjki is an indicator variable for membership in class (j, k). We then estimate V (¯ qstr ) using the usual methods for stratified samples. Using the summary statistics, ³ ´ 2 sh Nh nh Stratum Nh nh q¯h s2h 1 − N Nh nh Faculty C.S. A.S. A.P. Total

1374 1960 252 95 3681

228 514 86 66 894

−.117 −.059 −.061 −.076

0.0792 0.0111 0.0207 0.0349

4.04 × 10−5 4.52 × 10−6 7.42 × 10−7 1.08 × 10−7 4.58 × 10−5

196CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS ˆ = 4.58 × 10−5 and Thus VˆL (θ) 2 XW =

θˆ2 0.00165 = = 36.2. ˆ ˆ 4.58 × 10−5 VL (θ)

We reject the null hypothesis with p-value < 0.0001. 10.7 Answers will vary, depending on how the categories for zprep are formed. 10.8 (a) Under the null hypothesis of independence the expected proportions are:

Smoking Status

Recommended .241 .020 .186

Current Occasional Never

Fitness Level Minimum Acceptable .140 .011 .108

Unacceptable .159 .013 .123

Using (10.2) and (10.3), c r X X (ˆ pij − pˆi + pˆ+j )2 = 18.25 pˆi+ pˆ+j i=1 j=1 µ ¶ c r X X pˆij pˆij ln = 18.25 = 2n pˆi+ pˆ+j

X2 = n G2

i=1 j=1

Comparing each statistic to a χ24 distribution gives p-value= .001. (b) Using (10.9), E[X 2 ] ≈ E[G2 ] ≈ 6.84 (c) XF2 = G2F =

4X 2 = 10.7, with p-value = .03 (comparing to a χ24 distribution). 6.84

10.9 (a) Under the null hypothesis of independence, the expected proportions are: Decision-making managers Advisor-managers Supervisors Semi-autonomous workers Workers

Males 0.076 0.018 0.064 0.103 0.279

Females 0.065 0.016 0.054 0.087 0.238

Using (10.2) and (10.3), X2 G2

r X c X (ˆ pij − pˆi+ pˆ+j )2 =n = 55.1 pˆi+ pˆ+j i=1 j=1 µ ¶ c r X X pˆij pˆij ln = 2n = 56.6 pˆi+ pˆ+j i=1 j=1

Comparing each statistic to a χ2 distribution with (2 − 1)(5 − 1) = 4 df gives “p-values” that are less than 1 × 10−9 .

197 (b) Using (10.9), we have E[X 2 ] ≈ E[G2 ] ≈ 4.45 (c) df = number of psu’s − number of strata = 34 (d) XF2 G2F

4X 2 = .899X 2 = 49.5 4.452 4G = = 50.8 4.45 =

The p-values for these statistics are still small, less than 1.0 × 10−9 . (e) The p-value for Xs2 is 2.6 × 10−8 , still very small. 10.11 Here is SAS code and output: options ovp nocenter ls=85; filename nhanes ’C:\nhanes.csv’; data nhanes; infile nhanes delimiter=’,’ firstobs=2; input sdmvstra sdmvpsu wtmec2yr age ridageyr riagendr ridreth2 dmdeduc indfminc bmxwt bmxbmi bmxtri bmxwaist bmxthicr bmxarml; bmiclass = .; if 0 > bmxbmi and bmxbmi < 25 then bmiclass = 1; else if bmxbmi >= 25 and bmxbmi < 30 then bmiclass = 2; else if bmxbmi >= 30 then bmiclass = 3; if age < 30 then ageclass = 1; else if age >= 30 then ageclass = 2; label age = "Age at Examination (years)" riagendr = "Gender" ridreth2 = "Race/Ethnicity" dmdeduc = "Education Level" indfminc = "Family income" bmxwt = "Weight (kg)" bmxbmi = "Body mass index" bmxtri = "Triceps skinfold (mm)" bmxwaist = "Waist circumference (cm)" bmxthicr = "Thigh circumference (cm)" bmxarml = "Upper arm length (cm)"; run; proc surveyfreq data=nhanes ; stratum sdmvstra; cluster sdmvpsu;

198CHAPTER 10. CATEGORICAL DATA ANALYSISIN COMPLEX SURVEYS weight wtmec2yr; tables bmiclass*ageclass/chisq deff; run; Table of bmiclass by ageclass Weighted Std Dev of Std Err of Design bmiclass ageclass Frequency Frequency Wgt Freq Percent Percent Effect -----------------------------------------------------------------------------------1 1 881 9566761 716532 6.0302 0.3131 0.8572 2 75 2686500 495324 1.6934 0.2434 1.7639 Total 956 12253261 1083245 7.7236 0.3951 1.0855 -----------------------------------------------------------------------------------2 1 788 19494074 1615408 12.2878 0.6689 2.0573 2 1324 57089320 4300350 35.9853 1.2334 3.2727 Total 2112 76583394 5452762 48.2731 1.2969 3.3381 -----------------------------------------------------------------------------------3 1 627 15269676 1538135 9.6250 0.6261 2.2338 2 1262 54539886 4631512 34.3783 1.2349 3.3499 Total 1889 69809562 5816243 44.0033 1.4066 3.9794 -----------------------------------------------------------------------------------Total 1 2296 44330511 3336974 27.9430 1.0011 2.4666 2 2661 114315706 8538003 72.0570 1.0011 2.4666 Total 4957 158646217 11335366 100.000 -----------------------------------------------------------------------------------Frequency Missing = 4686 Rao-Scott Chi-Square Test Pearson Chi-Square Design Correction

525.1560 1.6164

Rao-Scott Chi-Square DF Pr > ChiSq

324.8848 2 F

162.4424 2 30 0 then isviol = 1; else if violent = 0 then isviol = 0; run; proc surveyfreq data=ncvs; stratum pstrat ; cluster ppsu; weight pweight; tables isviol*sex/chisq; run;

Table of isviol by sex Weighted Std Dev of Std Err of isviol sex Frequency Frequency Wgt Freq Percent Percent ---------------------------------------------------------------------------0 0 36120 107752330 1969996 47.6830 0.1620 1 42161 115149882 1906207 50.9566 0.1638 Total 78281 222902212 3813397 98.6396 0.0663 ---------------------------------------------------------------------------1 0 558 1746669 101028 0.7729 0.0445 1 441 1327436 81463 0.5874 0.0341 Total 999 3074105 155677 1.3604 0.0663 ---------------------------------------------------------------------------Total 0 36678 109498999 1985935 48.4560 0.1597 1 42602 116477318 1930795 51.5440 0.1597 Total 79280 225976317 3853581 100.000 ---------------------------------------------------------------------------Frequency Missing = 80

Rao-Scott Chi-Square Test Pearson Chi-Square Design Correction

30.6160 1.0466

Rao-Scott Chi-Square DF Pr > ChiSq

29.2529 1 F

29.2529 1 143