324 104 20MB
English Pages 480 Year 1970
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS P. V. SUKHATME
& B.
V. SUKHATME
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
Other Books Feeding India's Growing Millions BY P.
V.
SUKHATME
Statistical Methods for Agricultural Workers BY
V. G.
PANSE AND P.
V.
SUKHATME
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
PANDURANG V. SUKHATME, Ph.D.,D.Sc. Director, Statistics Division, Food and Agriculture Organization of the United Nations Formerly Statistical Adviser, Indian Council of Agricultural Research, New Delhi
AND
BALKRISHNA V. SUKHATME, Ph.D. Professor of Statistics, Iowa State University, Ames, Iowa, and Indian Council of Agricultural Research, New Delhi
IOWA
STATE
UNIVERSITY
AMES, IOWA, U.S.A.
PRESS
© 1954 by the Indian Society of Agricultural Statistics, New Delhi, India, and the Iowa State University Press, Ames, Iowa, U.S.A. © 1970 by Pandurang V. Sukhatme and Balkrishna V. Sukhatme. All rights reserved. Printed in the U.S.A. First Edition, 1954 Second Printing, 1958 Third Printing, 1960 Second, Revised Edition, 1970 Spanish Translation: Teoria de Encuestas por Muestreo con Aplicaciones International Standard Book Number:
0-8138-1370-0
FOREWORD TO THE SECOND EDITION
edition of the book was issued 15 years ago when I was Director of the Economics Division of FAO and the author, Dr. P. V. Sukhatme, headed the Statistics Branch in my Division. I recall going with Dr. Sukhatme to the late Norris E. Dodd, then Director-General of FAO, and requesting him to write the foreword that appeared in the first edition. This memory has inevitably come to my mind now that Dr. Sukhatme has asked me. within so short a period of my assuming charge as Director-General of FAO, to write a foreword to the second edition The first
of this book. I do so with great pleasure. Among the major tasks of the FAO is the promotion of the improve¬ ment of national systems for collecting agricultural statistics. The pro¬ gramme of work of FAO, as approved by the Conference in 1952, speci¬ fically recommended the promotion of the use of sampling as a major means of achieving this task. This programme, among other measures, included the preparation of a series of publications on the theory and practice of sampling surveys. Dr. Sukhatme’s book was the first in this series. This is not the place to elaborate on the work done by FAO in the field of sampling. Suffice it to say that there is hardly any country in the world today which does not use sampling in one form or another for collecting and reporting national statistics. The widespread use of sampl¬ ing is best illustrated with reference to decennial censuses of agriculture. Whereas less than a quarter of the developing countries participating in the 1950 World Census of Agriculture made use of sampling in carry¬ ing out their agricultural censuses, this proportion increased to threefifths in the 1960 World Census of Agriculture. For the 1970 World Census of Agriculture, which FAO is currently engaged in promoting, most of the developing countries have indicated that they plan to use sampling in their censuses of agriculture. FAO has a large number of field experts in different countries of the world whose job is to assist those countries in developing the use of sampl¬ ing for improving agricultural statistics and in the training of local person¬ nel in the use of sampling methods. Likewise, FAO has a large training programme in the theory and practice of sampling methods for the collec¬ tion of national statistics. For the 1970 World Census of Agriculture alone, FAO has organized, or is in the process of organizing, some 20 training centres at the national, regional and international levels. Dr. Sukhatme s
VI
FOREWORD
book has been extensively used in this programme of work. The years since the publication of the first edition have seen a considerable advance in the theory and practice of sampling. I have particularly in mind here the vast experience accumulated by FAO in the application of sampling methods under widely different field conditions. The publication of the second edition of the book, which incorporates the latest advances in the use of sampling, which summarizes the experience accumulated by FAO and which also comes at a time when FAO is promoting the 1970 World Census of Agriculture, is therefore most timely, and will, I feel sure, be welcomed by all. The need for reliable and comprehensive statistics was never more acutely felt than at present, when most countries are engaged in largescale planning. Statistics, however, take time to develop and require a great deal of effort and money which are not always easy to find. While planning cannot await the full development of statistics, there can be no sound planning without reliable statistics. The merit of sampling lies in providing an economic and efficient method for collection of data and thus accelerating the development of statistics urgently needed by planners. It is my hope that this book will make a significant contribution to the realization of this objective. A. H. Director-General Food and Agriculture Organization
Boerma
PREFACE TO THE SECOND EDITION
edition of the book in English has been out of print for quite some years now. The Spanish edition continues to be reprinted; the latest reprint was in 1968. We have had some hesitation in bringing out the present edition, mainly because of the large number of books on sampling theory of surveys which have been published since the first edition of the book in 1954. However, we continued to receive numerous enquiries and in fact we were urged to undertake the revision. Accordingly, B. V. Sukhatme, who has been teaching courses in sampling theory at ICAR, New Delhi, and at Iowa State University, undertook to revise the book. The first
This edition retains essentially the same character as the first except that algebra has been somewhat condensed and exercises have been added at the end of each chapter. Most of the exercises have been taken from published statistical literature. Chapter II has been re-written and several chapters have been expanded. Chapter X has been expanded mainly with a view to meet the comments on the first edition that more exam¬ ples be given in the book to deal with the methodological problems that one faces in practice. The only way to do justice to this aspect, without unduly increasing the length of the book, would have been to prepare another volume using case studies. Such a volume in fact was prepared as early as 1955 jointly with Dr. Panse, but remained unpublished for one reason or another. The best we could do in this edition was to include more examples in Chapter X as well as to add a new section summariz¬ ing recent experiences in agricultural censuses in developing countries. We are aware that this is all too brief; nevertheless we trust that it will give the reader a greater appreciation of the problems met with in prac¬ tice, as also of the relative magnitude of the different sources of error and the need to control them as best as one can within the resources of men and money available. We have been happy at the reception given to the first edition of the book and we hope this new edition will be received equally well. We are particularly indebted to Dr. O. P. Aggarwal, formerly on the teach¬ ing staff at Iowa State University and now Chief of the Census Branch at FAO, for his hard work in reading through the manuscript critically, in suggesting appropriate changes and in introducing a great deal of clarity and rigour in the revision. We also wish to express our gratitude to Messrs. M. S. Avadhani, L. A. Gould, J. Sedransk, R. Singh, S. S. Zarkovic and R. Zasepa, who read parts of the manuscript and made
via
PREFACE
many helpful suggestions. As in the case of the first edition, the respon¬ sibility of reading the proofs was principally undertaken by Dr. P. N. Saxena. Chief Statistician, IARI, New Delhi, to whom we are deeply grateful for his devotion and effort. We also wish to thank Mrs. Avonelle Jacobson for the care and accuracy with which our manuscript was typed and reproduced in the form of mimeograph notes for use by students. We are indebted to Mr. A. H. Boerma, Director-General of FAO, and Dr. T. A. Bancroft, Director of the Statistical Laboratory, Iowa State University, for encouraging us to undertake the revision in the conviction that it will assist in the teaching of an advanced course in the theory and practice of sample surveys as well as in promoting sampling for improvement of agricultural statistics in the developing countries, particularly at a time when FAO is so actively engaged in the promotion of the 1970 World Census of Agriculture. P. V. B. V.
Sukhatme Sukhatme
PREFACE TO THE FIRST EDITION
is an outgrowth of lectures on sample surveys which the author has delivered since 1945 at the Indian Council of Agricultural Research, subsequently at the International School on Censuses and Statistics in 1949-50 held at Delhi under the auspices of the Food and Agriculture Organization of the United Nations, at the two summer sessions conducted by the Indian Society of Agricultural Statistics in 1950 and 1951, and finally at the Statistical Laboratory of the Iowa State College, Ames, Iowa, U.S.A., in the spring of 1952.
This
book
There was no plan at first of publishing a book and the notes pre¬ pared for the lectures were mimeographed for the use of the students, but as the scope of the course was gradually enlarged, suggestions were received that the lectures should be published in the form of a text for teaching at colleges and universities. It was felt that this publication would fulfil a real need for a systematic treatment of the sampling theory in relation to large-scale surveys. About the same time the Conference of the Food and Agriculture Organization of the United Nations recom¬ mended at its Sixth Session that a book be prepared incorporating a comprehensive treatment of the sampling theory of surveys and its applica¬ tions so as to be of direct assistance to the sampling experts working in various countries in their efforts to introduce the sampling method for improvement of agricultural statistics. The mimeographed notes were accordingly reorganized and amplified to include illustrative material on agricultural surveys from different countries; the publication of the present book is the result. In keeping with its objectives the book is primarily designed to serve the needs of a text for teaching an advanced course in sampling theory of surveys and of a reference book for statisticians entrusted with the planning of surveys for collecting statistics. Every attempt has been made to present all the modern developments of sampling theory which are of importance in survey work. Some of the results have already appeared in the papers published in the Journal of the Indian Society of Agricultural Statistics. These might appear new to many readers since they might not have seen this Journal. The book also gives a number of results which are being published for the first time. Among these should be mentioned particularly the algebraic treatment of non-sampling errors whose importance relative to sampling errors has not been sufficiently stressed in the literature on the subject.
X
PREFACE
In order that the theory presented in the book should be of direct assistance in practice, it is illustrated with examples of actual surveys so as to serve the special needs of under-developed countries in the field of sampling, as recommended by FAO. These examples are oriented largely around agricultural statistics, in keeping with the author’s experience in this field and FAO’s interest, and relate to surveys for the estimation of crop acreage, yield, incidence of insect pests on crops, livestock numbers and their products, other farm facts and fisheries production. The author is conscious that these examples by themselves will not meet the entire needs of sampling workers, particularly those from the eco¬ nomically less developed countries where the resources available for planning surveys are meagre and a large majority of the people are illiterate, do not appreciate the purpose of the inquiry, nor know the correct answers to the questions put to them. The contribution to the total error in the result arising from this latter factor is very large in these countries and emphasizes the great value of developing satisfactory mea¬ surement techniques before attempting nation-wide surveys. The relevant theory bearing on this question has been discussed in Chapter X. What is further needed is a simple exposition of a few typical surveys. Such a book is nearing completion and it is hoped to make it available soon. The need for keeping the volume within reasonable size has pre¬ vented any elaborate supporting description of the theory and examples given in the book. The author’s aim all along has been to present the theory in as straightforward a manner as possible. The only prerequisites are college algebra, elements of calculus and principal statistical methods such as those covered in Statistical Methods for Agricultural Workers, by V. G. Panse and the author. Even so the author is aware that at places the treatment has become too terse. Such sections have been marked with an asterisk to indicate that the portion can be left over from the first reading without losing the continuity of the text. The author has received considerable assistance in preparing the book from his former colleagues in India. First of all he gratefully acknowledges the encouragement and help which he received from his former Chief, Mr. P. M. Kharegat, then Secretary to the Ministry of Agriculture, Government of India, to whose farsightedness are principally due the advances which India has made in the field of sampling. He is indebted to Messrs. V. G. Panse, G. R. Seth, K. Kishen, R. D. Narain, O. P. Aggarwal and B. V. Sukhatme who read parts of the manuscript and made numerous suggestions to improve the presentation; to Messrs. K. S. Krishnan, S. H. Ayer and K. V. R. Sastry who worked through the examples; and to Mrs. Evans of the Statistics Branch of FAO who check¬ ed through them and also helped in the preparation of the index to the book; to Dr. P. N. Saxena who shouldered a particularly heavy respon-
PREFACE
XI
sibility of reading critically the manuscript and the proofs; and to Suzanne Brunelle and Mary Nakano for their typing and secretarial help. The author also likes to express his thanks to Dr. T. A. Bancroft, Dr. D. J. Thompson and other members of the staff of the Statistical Laboratory, Iowa State College, with whom he had the opportunity to work as visiting professor during the spring term of 1952 and to Marshall Townsend of the Iowa State College Press for their interest and encourage¬ ment in the publication of the book. Last but not least the author is indebted to Mr. Norris E. Dodd, Director-General of the FAO, who invited the author to come to FAO to head the Statistics Branch, which gave him the opportunity to appreciate more fully the urgent need for promoting sampling for improving agricultural statistics in under¬ developed countries; and to Dr. A. H.Boerma, Director ofEconomics Division of the FAO, for his constant encouragement and advice. September 1953
PANDURANG
V.
SuKHATME
CONTENTS FOREWORD TO THE SECOND EDITION PREFACE TO THE SECOND EDITION PREFACE TO THE FIRST EDITION I.
v vii ix
BASIC THEORY: SIMPLE RANDOM SAMPLING 1 1.1 Concept of Sampling, 1; 1.2 Simple Random Sampling, 3; 1.3 Procedure of Select¬ ing a Simple Random Sample, 5; 1.4 Notation, 7; 1.5 Properties of Estimates: Unbiased¬ ness and Consistency, 8; 1.6 Sampling Variance, Standard Error and Mean Square Error, 10; 1.7 Expected Value of the Sample Mean, 10; 1.8 Expected Value of the Sample Mean Square, 12; 1.9 Sampling Variance of the Mean, 14; 1.10 Expected Value and Sampling Variance of s, 15; 1.11 Confidence Limits, 17; 1.12 Size of Sample for Specified Precision, 19; 1.13 Hyper-Geometric Distribution—Two Classes, 21; 1.14 Mean Value of the Hyper-Geometric Distribution, 21; 1.15 Variance of the Hyper-Geometric Distribution, 22; 1.16 Confidence Limits and Size of Sample for Specified Precision, 24; 1.17 Generalized Hyper-Geometric Distribution, 25; 1.18 Expected Value and Variance of the Proportion in one Class to a Group of Classes, 27; 1.19 Variance of a Function of Two Random Variables, 30; 1.20 Inverse Sampling, 31; 1.21 Quantitative and Qualita¬ tive Characteristics, 33; Exercises, 37; References, 41; Appendix I: Table of Random Numbers, 43.
II.
SAMPLING WITH VARYING PROBABILITIES 45 2.1 Introduction, 45; 2.2 Procedure of Selecting a Sample with Varying Probabilities, 46; 2.3 Sampling with Replacement: Sample Estimate and its Variance, 47; 2.4 Sampling with Varying Probabilities and without Replacement, 52; 2.5 Ordered Estimates, 53; 2.6 Unordered Estimates, 56; 2.7 Horvitz-Thompson Estimate, 59; 2.8 Midzuno System of Sampling, 64; 2.9 Narain Method of Sampling, 66; 2.10 Systematic Sampling with Varying Probabilities, 66; 2.11 Other Developments, 68; 2.12 Comparison of Different Sampling Procedures, 72; Exercises, 76; References, 78.
III.
STRATIFIED SAMPLING 80 3.1 Introduction, 80; 3.2 Estimate of the Population Mean and its Variance, 81; 3.3 Choice of Sample Sizes in Different Strata, 82; 3.4 Variance of the Weighted Mean under Different Systems of Allocation, 85; 3.5 Comparison of Stratified Sampling with Simple Random Sampling without Stratification, 86; 3.6 Practical Difficulties in Adopt¬ ing the Neyman Method of Allocation, 88; 3.7 Estimation of the Gain in Precision due to Stratification, 91; 3.8 Post-Stratification for Improving the Precision of a Simple Random Sample, 94; 3.9 Effect of Increasing the Number of Strata on the Precision of the Estimate, 96; 3.10 Effects of Inaccuracies in Strata Sizes, 97; 3.11 Construction of Strata, 108; 3.12 Deep Stratification, 112; 3.13 Allocation of Sample Size to Strata with Several Characteristics, 118; 3.14 The Method of Collapsed Strata, 121; 3.15 Controlled Selec¬ tion, 122. VARYING PROBABILITIES OF SELECTION
3.16 Estimate of the Population Mean and its Sampling Variance, 123; 3.17 Allocation of Sample Among Different Strata, 124; 3.18 Variance of the Estimate under (i) Opti¬ mum Allocation, and (ii) Proportional Allocation when the Total Size of Sample is
CONTENTS
XIV
Fixed, 126; 3.19 Efficiency of Stratified Sampling, 126; 3.20 Estimation of the Change in Variance due to Stratification, 129; Exercises, 131; References, 133.
IV. RATIO METHOD OF ESTIMATION
135
4.1 Introduction, 135; 4.2 Notation and Definition of the Ratio Estimate, 135; 4.3 Expected Value of the Ratio Estimate, 136; 4.4 Second Approximation to the Expected Value of the Ratio Estimate, 140; 4.5 Variance of the Ratio Estimate, 141; 4.6 Estimate of the Variance of the Ratio Estimate, 144; 4.7 Second Approximation to the Variance of the Ratio Estimate, 145; 4.8 An Optimum Property of Ratio Estimate, 146; 4.9 Confidence Limits, 148; 4.10 Efficiency of the Ratio Estimate, 150; 4.11 Ratio Estimate in Stratified Sampling, 153; 4.12 Unbiased Ratio-type Estimate, 160; 4.13 Multivariate Extension of the Ratio Method of Estimation, 167; 4.14 The Case when xu is not Known, 169, 4.15 Ratio Method for Qualitative Characteristics: Two Classes, 177; 4.16 Extension to k Classes, 179. SAMPLING WITH VARYING PROBABILITIES
4.17 Ratio Estimate and its Variance, 180; Exercises, 186; References, 188; Appendix II. Expected Values of Certain Higher Order Product Moments, 190.
V. REGRESSION METHOD OF ESTIMATION 5.1 Introduction, 193; 5.2 Simple Regression Estimate, 193; 5.3 Expected Value of the Simple Regression Estimate, 194; 5.4 Variance of the Simple Regression Estimate, 195; 5.5 Estimate of the Variance of the Simple Regression Estimate, 197; 5.6 Conditions under which the Simple Regression Estimate is Optimum, 197; 5.7 Estimation of the Variances of Simple and Weighted Regression Estimates under Optimum Conditions 201; 5.8 Comparison of Weighted Regression Estimate with Simple Regression Estimate, 204; 5.9 Comparison of Simple Regression Estimate with the Ratio Estimate and the Simple Unbiased Estimate, 208; 5.10 Comparison of Simple Regression with Stratified Sampling, 209; 5.11 Double Sampling, 210; 5.12 Regression Estimates in
Stratified
Sampling, 213; 5.13 Successive Sampling, 216; Exercises, 220; References, 221. VI. CHOICE OF SAMPLING UNIT
222
A. EQUAL CLUSTERS
6.1 Cluster Sampling, 222; 6.2 Efficiency of Cluster Sampling, 223; 6.3 Efficiency of Cluster Sampling in Terms of Intra-Class Correlation, 227; 6.4 Estimation from the Sample of the Efficiency of Cluster Sampling, 231; 6.5 Relationship between the Variance of the Mean of a Single Cluster and its Size, 233; 6.6 Optimum Unit of Sampling and Multipurpose Surveys, 237; 6.7 Use of Supplementary Information in Improving the Efficiency of Cluster Sampling, 242. B. UNEQUAL CLUSTERS
6.8 Estimates of the Mean and their Variances, 243; 6.9 Efficiency of Cluster Sampling when Clusters are of Unequal Size, 247; 6.10 Sampling with Replacement and Unequal Probabilities: Estimate of the Mean and its Variance, 247; 6.11 Probability Proportional to Cluster Size: Efficiency of Cluster Sampling, 249; 6.12 Probability Proportional to Cluster Size: Relative Efficiency of Different Estimates, 250; Exercises, 259; References, 261.
VII. SUB-SAMPLING
262
7.1 Introduction, 262; 7.2 Two-stage Sampling, Equal First-stage Units: Estimate of the Population Mean and its Variance, 262; 7.3 Two-stage Sampling, Equal First-stage Units: Estimation of the Variance of the Sample Mean, 264; 7.4 Allocation of Sample
CONTENTS
XV
to the Two Stages: Equal First-stage Units, 267; 7.5 Comparison of Two-stage with One-stage Sampling, 272; 7.6 Effect of Change in Size of First-stage Units on the Vari¬ ance, 274; 7.7 Three-stage Sampling, Equal First-stage and Second-stage Units: Sample Mean and its Variance, 276; 7.8 Three-stage Sampling, Equal First-stage and Secondstage Units: Estimation from the Sample of the Variance of the Mean, 278; 7.9 Allocation of Sample to the Three Stages, 280; 7.10 Two-stage Sampling, Unequal
First-stage
Units: Estimate of the Population Mean, 282; 7.11 Two-stage Sampling, Unequal Firststage Units: Expected Values and Variances of the Different Estimates, 284; 7.12 Twostage Sampling, Unequal First-stage Units: Estimation of the Variances from the Sample, 291; 7.13 Two-stage Sampling, Unequal First-stage Units: Allocation of Sample, 294; 7.14 Three-stage Sampling, Unequal First- and Second-stage Units, 301; 7.15 Stratified Sub-sampling, 305; 7.16 Optimum Allocation in Stratified Sub-sampling, 307; 7.17 Effi¬ ciency of Stratification in Sub-sampling, 308; Exercises, 312; References, 314.
VIII. SUB-SAMPLING (continued)
315
8.1 Introduction, 315; 8.2 Estimate of the
Population Mean and
its
Variance, 315;
8.3 Estimation of the Variance from the Sample, 317; 8.4 Allocation of Sample,
318;
8.5 Determination of Optimum Probabilities, 321; 8.6 Ratio Estimate, 323; 8.7 Allocation of Sample and Determination of Optimum Probabilities: General Case, 325; 8.8 Relative Efficiency of the Two Sub-sampling Designs, 327; 8.9 Sub-sampling without Replacement, 328; 8.10 Estimation of the Variance from the Sample when Sub-sampling is carried out without Replacement, 330; 8.11 Stratification and the Gain Due to it, 332; 8.12 Sub¬ sampling with Varying Probabilities of Selection at Each Stage, 340; 8.13 Sampling without Replacement at Each Stage, 342; 8.14 Self-weighting Designs, 344; Exercises, 348; References, 349.
IX.
SYSTEMATIC SAMPLING
351
9.1 Introduction, 351; 9.2 Systematic Sampling in Two Dimensions, 353; 9.3 The Sample Mean and its Variance, 354; 9.4 Comparison of Systematic with Random Sampling, 356; 9.5 Comparison of Systematic with Stratified Random Sampling, 358; 9.6 Compa¬ rison of Systematic with Simple and Stratified Random Samples for Certain Specified Populations, 360; 9.7 Estimation of the Variance, 369; 9.8 Two-stage Sample: Equal First-stage Units: Systematic Sampling of
Second-stage
Units, 371; 9.9
Two-stage
Sample: Unequal First-stage Units: Systematic Sampling of Second-stage Units, 373;
Exercises, 377; References, 378.
X.
NON-SAMPLING ERRORS
380
PART A - ERRORS IN SURVEYS
10.1 Introduction, 380; 10.2 Types of Errors, 380. PART B - OBSERVATIONAL ERRORS
10.3 Mathematical Model for the Measurement of Observational Errors, 390; 10.4 Sample Mean and its Variance, 391; 10.5 Estimation of the Different Components of Variance, 397; 10.6 The Mean and Variance of a Stratified Sample in which Enumerators are Assigned the Units in their Respective Strata, 402; 10.7 The Mean and Variance of an Un¬ stratified Sample in which Enumerators are Assigned Neighbouring Units, 404; 10.8 Determination of the Optimum Number of Enumerators, 407;
10.9 Some Studies cn
Enumerator Variability and Respondent Bias, 408; 10.10 Limitations of the Method of Replicated
Samples
in
Surveys,
412.
CONTENTS
XVI
PART C - INCOMPLETE SAMPLES
10.11 The Problem, 417; 10.12 Effects of Norn response, 417; 10.13 Hansen and Hurwitz Technique, 420; 10.14 Politz and Simmons Technique, 423; 10.15
Further Develop¬
ments, 427. PART D - RECENT EXPERIENCES IN SAMPLE CENSUSES OF AGRICULTURE
10.16 Introduction, 429; 10.17 The frame, 429; 10.18 Practical Difficulties of the “Open Segment” Concept, 434; 10.19 Response Errors, 437;
10.20 Relative
Magnitude of
Sampling and Non-sampling Errors, 438; 10.21 Census Data and Planning, 440; Exercises, 441; References, 444. INDEX
447
CHAPTER
I
BASIC THEORY SIMPLE RANDOM SAMPLING 1.1
Concept of Sampling
The purpose of statistical surveys is to obtain information about populations. By ‘population’ we understand a group of units defined according to the aims of a survey. Thus, the population may consist of all the fields under a specified crop as in area and yield surveys, or all the agricultural holdings above a specified size as in agricultural surveys. Of course, the population may also refer to persons either of the whole population of a country or a particular sector thereof. The information that we seek about the population is normally the total number of units, aggregate values of the various characteristics, averages of these character¬ istics per unit, proportions of units possessing specified attributes, etc. In the collection of data there are basically two different approaches. The first is called complete enumeration. It consists of the collection of data on the survey items from each unit of the population. This procedure is used in censuses of population, agriculture, retail stores, industrial establishments, etc. The other approach, which is more general since the first can be considered as its special case, is based on the use of sampling methods and consists of collecting data on survey items from selected units of the population. A sampling method is a scientific and objective proce¬ dure of selecting units from the population and provides a sample that is expected to be representative of the population as a whole. It also pro¬ vides procedures for the estimation of results that would be obtained if a comparable survey were taken on all the units in the population. In other words, a sampling method makes it possible to estimate the population totals, averages or proportions while reducing at the same time the size of survey operations. A distinctive feature of surveys based on the use of sampling methods called for brevity sample surveys is sampling errors. The feature refers to the discrepancies between the sample estimates and the population values that would be obtained from enumerating all the units in the population in the same way in which the sample is enumerated. These
1
2
SAMPLING
THEORY
OF
SURVEYS WITH
APPLICATIONS
discrepancies are unavoidable because sample estimates are based on data for only a sample of units. The employment of sampling methods, however, enables estimates of the average magnitude of these discrepan¬ cies to be made. Sampling methods also provide the means of fixing in advance the details of survey design, such as the size of the sample, in such a way that the average magnitude of the sampling errors does not exceed the amount allowed with a preassigned probability. In other words, sampling methods enable us to control the precision of sample estimates within limits fixed in advance. As described in this book, sampling methods are based on laws of chance and the application of the theory of probability. There are other methods of sampling referred to under the name of purposive selection or judgment sampling. In these methods, units are selected in the sample according to how typical they are of the population according to the judgment of specialists in the subject matter. The composition of the sample resulting from the application of such a selection procedure is influenced by the personal judgment of those responsible for selection. The procedure is not objective; neither is it based on the principles of the theory of probability. Consequently, it does not provide the possibi¬ lity of estimating and controlling the magnitude of sampling errors. In this book we shall concern ourselves only with probability sampling. A simple way of obtaining a probability sample is to draw the units one by one with a known probability of selection assigned to each unit of the population at the first and each subsequent draw. The successive draws may be made with or without replacing the units selected in the preceding draws. The former is called the procedure of sampling with replacement, the latter sampling without replacement. The application of the probability sampling method as considered in this book assumes that the population can be subdivided into a finite number of distinct and identifiable units called sampling units. It is irrele¬ vant for the sampling procedure what the sampling units are. They may be natural units such as individuals in a human population or fields in a crop survey or natural aggregates of such units like families or villages, or they may be artificial units such as a single plant, a row of plants or a plot of specified size, in sampling a field. For sampling pur¬ poses it is essential to be able to list all of the sampling units in the popu¬ lation. Such a list is called the frame and provides the basis for the selection and identification of the units in the sample. Examples of a frame are a list of farms, suitable area segments like villages in India and sections in the United States. The village or section forms the sampling unit and provides the means for further selecting a sample of farms, fields and plots. The frame also often contains information about the size and struc-
BASIC theory: simple random sampling
3
ture of the population. That information is used in sample surveys in a number of ways as will be explained in subsequent chapters.
1.2
Simple Random Sampling
The simplest of the methods of probability sampling is known as the method of simple random sampling. Although the method is not much used in actual practice, it is useful to start with it in the presentation of the sampling theory of surveys as its understanding is helpful in the study of other and more complex methods. In the simple random sampling method, usually called the method of random sampling for brevity, an equal probability of selection (equal to the reciprocal of the number of available units) is assigned to each avail¬ able unit of the population at the first and each subsequent draw. Thus, if the number of units in the population is TV, the probability of selecting any unit at the first draw is 1/TV, the probability of selecting any unit from among the available units at the second draw is 1/(TV — 1), and so on. The sample obtained following the above method is called a simple random sample. An important property of simple random sampling is that the probabi¬ lity of a specified unit of the population being selected at any given draw is equal to the probability of its being selected at the first draw. For, let n denote the number of units to form the sample. The probability that the specified unit is selected at the r-th draw is clearly the product of (1) the probability of the event that it is not selected in any of the pre¬ vious (r — 1) draws; and (2) the probability of the event that it is selected at the r-th draw under the assumption that it is not selected in any of the previous (r — 1) draws. The probability that it is not selected at the first draw is, by definition, (TV — 1)/TV; that it is not selected at the second draw, given that it was not selected at the first draw, is (TV — 2)/(TV— 1), and so on. The probability (1) is, therefore,
TV-1 TV
TV-2
TV-r+1
'TV- 1 "'TV-r + 2 ”
TV-r+1 TV
The probability (2) is clearly 1/(TV— r+ 1). The probability that the specified unit of the population is selected at the r-th draw, being the product of the two, is therefore 1/TV which is the probability of its being selected at the first draw and is independent of r. Since the specified unit may be included in the sample at any of the n draws, it also follows that the probability that it is included in the sample is the sum of the probabilities of n mutually exclusive events, namely, that the specified unit is included in the sample at the first draw, second draw, ..., n-th draw. We have seen that each of these
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
4
events has the same probability 1 jN- Thus, the probability that the speci¬ fied unit is included in the sample is equal to n/JV. Since the above result is independent of the specified unit, it follows that every one of the units in the population has the same chance of being included in the sample under the procedure of simple random sampling. This is sometimes offered as the definition of simple random sampling, without realizing that there are other procedures of sampling (e.g., systematic sampling, Chapter IX), which also have this property, and yet they do not give the same chance of selection to each available unit of the population, at the first draw, or in subsequent draws. The method of simple random sampling is nevertheless equivalent to giving an equal chance to each possible cluster of n units of the popula¬ tion to form the sample. Since the number of possible clusters of size n out of a population of size N is
, simple random sampling is equi¬
valent to the selection of one of these possible clusters, with an equal probability, 1
, assigned to each cluster. Thus, if the population
consists of four farms serially numbered 1, 2, 3 and 4 having 2, 3, 4 and 7 acres under corn respectively, then a random sample of two farms from this population can be obtained by first drawing a farm with equal probability 1/4, assigned to each of the four farms, and then drawing a second farm with equal probability 1/3, assigned to each of the three remaining farms. Alternatively, we may consider the following six possi¬ ble clusters each of two farms from this population:
Serial Number of Cluster 1 2 3 4 5 6
Serial Numbers of Units in the Cluster 1, 1, 1, 2, 2, 3,
2 3 4 3 4 4
Values of the Units in the Cluster 2, 2, 2, 3, 3, 4,
3 4 7 4 7 7
and with an equal probability 1/6, assigned to each of the six possible clusters, select one of the clusters as a sample for our study. These two procedures of drawing a sample are equivalent (see exercise 1.1 at the end of the chapter). The word ‘random’ refers to the method of selecting a sample rather than to the particular sample selected. The fact that the above two methods for selecting the sample are equivalent does not mean that they
BASIC theory: simple random sampling
5
will choose the same sample in a particular case. Two persons, even when following the same method, will hardly, if ever, get the same sample in practice. Furthermore, any possible sample can be a random sample, however unrepresentative it may appear, so long as it is obtained by following any of the above two procedures of selecting a random sample. Thus, a person may draw a random sample of 13 cards from a well-shuffled pack and still find that all are of the same suit. The sample is obviously unrepresenta¬ tive of different colours in this case, but nevertheless must be considered to be a random sample, by viitue of the method of random selection employed. The chance of obtaining such an unrepresentative sample is however extremely small. 1.3
Procedure of Selecting a Simple Random Sample
A practical procedure of selecting a random sample is to choose units one by one with the help of a table of random numbers, such as those published by Rand (1955) or Tippett (1927). A few pages from the latter are reproduced in Appendix I. A table of random numbers is so cons¬ tructed that all numbers 0, 1, 2, . .., 9, appear, independently of one another, with approximately the same frequency. By combining numbers in pairs we have the numbers 00 to 99, and obviously they also appear with approximately the same frequency. Similarly, we may use the numbers three at a time to get random numbers from 000 to 999, four at a time to get random numbers from 0000 to 9999, and so on. The procedure of selection of a random sample takes the form of (a) identifying the N units in the population with the numbers 1 to JV, or, what is the same thing, preparing a list of units in the population in any order and serially numbering them; (b) reading numbers from the table of random numbers by starting at any arbitrary place; and (c) taking for the sample the n units whose serial numbers correspond to those drawn from the table of random numbers. The following examples will illustrate the procedure. Example 1.1 Select a sample of 34 villages from a list of 338 villages. Using the three-figure numbers given in columns 1 to 3, 4 to 6, etc., of the table given in the Appendix and rejecting numbers greater than 338 (and also the number 000), we have for the sample: 35, 251, 165, 131, 198, 125, 326, 12, 237, 51, 52, 331, 218, 337, 263, 33, 161, 209, 40, 99, 102, 42, 223, 241, 277, 14, 303, 81. 173, 137, 321, 335, 155, 163,
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
6
The procedure involves the rejection of a large number of iandom numbers, nearly two-thirds. A device commonly employed to avoid the rejection of such large numbers is to divide a random number by 338 and to choose the serial number from 1 through 337 corresponding to the remainder when it is not zero, and the serial number 338 when the remainder is zero. However, it is necessary to reject random numbers 677 to 999 (besides 000) in adopting this procedure as otherwise villages with serial numbers 1 to 323 will get a larger chance of selection, equal to 3/999, while those with serial numbers 324 to 338 will get a chance equal to 2/999. If we use this procedure and also the same three-figure random numbers as given in columns 1 to 3, 4 to 6, etc., we obtain the sample of villages with serial numbers given below: 125, 325, 312, 52, 11,
206. 338, 165, 331, 223.
326, 114, 131, 218,
193, 231, 198, 337,
12, 78, 33, 238,
237, 112, 161, 323,
35, 126, 209, 263,
251, 330, 51, 90,
Example 1.2 Nine villages in a certain administrative area contain 793, 170, 970, 657, 1721, 1603, 864, 383 and 826 fields, respectively. Make a random selection of 6 fields, using the method of random sampling. The total number of fields in all the 9 villages is 7987. The first step in the selection of a random sample of fields is to assume these serially numbered from 1 to 7987, by taking successive cumulative totals: 793,
963,
1933,
2590,
4311,
5914,
6778,
7161,
7987,
the 793 fields in village 1 being assumed to have the serial numbers 1 through 793, the 170 fields in village 2 being assumed to have the serial numbers 794 through 963, and so on. A reference to the four-digit random numbers obtained by reading to¬ gether columns 9 through 12 in the Appendix will then give the following sample of fields with serial numbers 7358, 922, 4112, 3596, 633 and 3999. The corresponding fields will be No. 197 from village 9, No. 129 from village 2, No. 1522 and No. 1006 from village 5, No. 633 from village 1, and another No. 1409 from village 5. It is obviously not necessary to have all the 7987 fields in the 9 villages numbered serially. Only the fields in villages 1, 2, 5 and 9 which happen to contain the random sample of 6 fields have to be serially numbered in this example. Care should be taken, however, to ensure that the manner of serially numbering the fields is not influenced by the knowledge of the particular random numbers selected for these villages.
basic theory: simple random sampling
7
It will be noted that this selection amounts to a selection of the sample in two stages, selecting a village in the first stage with proba¬ bility proportional to the number of fields in the village while sampling with replacement, and in the second stage choosing a field by simple random sampling in each selected village on the basis of the random numbers already selected. In this example, the villages selected are 9, 2, 5, 5, 1 and 5 in that order. Note that because of replacement, village 5 has been selected three times. It is obvious that if we were to select a sample in two stages, selecting a number of first-stage units with probability proportional to the num¬ ber of second-stage units (in each first-stage unit) while sampling with replacement, and then selecting one second-stage unit from each of the selected first-stage units each time it is selected, we could make the selection in one step by taking successive cumulative totals of the secondstage units and selecting the required number of second-stage units bv simple random sampling. The two procedures are equivalent, but it must be emphasized that this equivalence between the one-? and two-stage sampling holds good only when the number of second-stage units to be selected from each first-stage unit of sampling is limited to the number of times the first-stage unit is selected (sampling with replacement allows its selection more than once). 1.4
Notation
In the rest of this chapter we shall present the theory of simple random sampling. We shall assume, unless otherwise mentioned, that the sampling units are drawn without replacement and that the characteristic observ¬ ed on any unit of the population has a unique value. Denote by jV the number of sampling units in the population, y
the characteristic under consideration,
yi the value of the characteristic for the i-th unit of the population, characteristic per unit
(1)
(2) the mean square for the population,
(3) the variance of a single observation in the population,
8
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
n the sample size, i.e., the number of units in the sample, ji, the sample mean,
(4)
and (5) the sample mean square, where the summation is taken over all the units in the sample. Clearly, yi when it refers to the sample does not necessarily connote the z-th unit in the population: it may be equal to yv y%, . . ., or jj>n de¬ pending upon the particular sample under consideration. 1.5
Properties of Estimates: Unbiasedness and Consistency
An estimate will vary from sample to sample, depending upon the units included in the sample. Thus, for the population mentioned in Section 1.2, the sample means will be seen to vary from 2*5 to 5-5 acres per farm. The sample mean square will similarly be found to vary from 0-5 to 12-5, as shown in Table 1.1. However, it will be seen that the averages of the sample means and the sample mean squares over the totality of all the samples are equal to the corresponding population values, which may also be referred to as parameters since we have assum¬ ed that the characteristic observed on any unit of the population is uni¬ quely determined. Such sample estimates are called unbiased estimates of the corresponding parameters. Algebraically, this is expressed as:
where the symbol E stands, as usual, for expectation. In general, we shall write Est. 0 = g indicating that g is an estimate of the parameter 0 (biased or unbiased). In the particular case under consideration Est. yN =yn and
(8)
9
BASIC theory: simple random sampling
Est. S2 = s2
(9)
and both are unbiased estimates. Sometimes when it is more convenient we shall also use the circumflex notation to denote the estimate, as yN and S2. It will be shown in the following sections that, when a sample is selected by the method of simple random sampling, an unbiased estimate of the population mean is given by the sample mean and an unbiased estimate of the population mean square by the sample mean square. Table
1.1
VALUES OF THE MEAN AND THE MEAN SQUARE IN DIFFERENT SAMPLES OF TWO FROM THE POPULATION MENTIONED IN SECTION 1.2. Serial No. of the Sample 1 2 3 4 5 6 Total Mean
Values of Units in the Sample 2, 2, 2, 3, 3, 4,
3 4 7 4 7 7
>'n
J2
y-n -^n
(y-^2
s2—Ss
(s2—s*y
2-5 3-0 4-5 3-5 5-0 5-5 24-0 4-0
0-5 2-0 12-5 0-5 80 4-5 28-0 4f
-1-5 —1-0 0-5 -0-5 1-0 1-5 0 0
2-25 1-00 0-25 0-25 1-00 2-25 7-0
—25/6 —16/6 47/6 —25/6 20/6 -1/6 0 0
625/36 256/36 2209/36 625/36 400/36 1/36 4116/36 19-05
H
We shall now define consistency. An estimate 0 (jyl3 y2, ..., yn) is said to be a consistent estimate of the parameter 0 if the probability that the difference between the estimate and the population value exceeds any given amount, tends to zero as n, the sample size tends to infinity. In other words, the estimate assumes the population value with probability approaching unity as the sample size increases indefinitely. A consistent estimate need not necessaiily be unbiased. If however it is biased, the bias will tend to zero in the limit as the sample size tends to infinity. This definition of consistency strictly applies to estimates based on samples drawn from infinite populations or those drawn with replace¬ ment and needs to be suitably modified in the case of finite populations. A detailed discussion is beyond the scope of this book. We shall adopt the following definition in the case of finite populations. An estimate 0 (yv y2, . .., yn) is said to be a consistent estimate of 0 if it becomes the population value when n = JV (Fisher, 1922). It will be observed that the sample mean y„ and the sample mean square s2 become the population mean yN and the population mean square S2, respectively, when n = N. Hence, the sample mean pn and the sample mean square s2 are consistent estimates of the population mean j/N and the popula-
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
10
tion mean square S2, respectively. In later chapters we shall come across several estimates which are not unbiased but are consistent. 1.6
Sampling Variance. Standard Error and Mean Square Error
The actual sampling error in an estimate cannot be known without knowing the exact value of the parameter, but a sampling method pro¬ vides a measure of the average magnitude of the sampling error in the estimate over all possible samples. A simple average of the actual sampl¬ ing errors over all possible samples is obviously zero in the case of un¬ biased estimates, as seen from Table 1.1. An average of the sampling error without regard to sign provides one measure, called the mean deviation, but this is not in common use. The average magnitude of the squares of sampling errors over all possible samples is called the sampling variance of the estimate and its square-root is the measure most com¬ monly used for defining the average sampling error. This measure of the average sampling error is called the standard error. This definition is strictly applicable in the case of unbiased estimates. More generally, for a biased estimate of the population value, the sampling variance is defined as the arithmetic mean of the squares of the differences between the sample estimate and the expected value of the estimate over all samples, and its square root is called the standard error. The arith¬ metic mean of the squares of the differences between the sample estimate and the population value, in this case, is called the mean square error. To see the connection between the variance and mean square error of an estimate, suppose that q is an estimate of 0. Define B = £(§) - 0 as the bias in the estimate 0. In the case of an unbiased estimate, B = 0. The sampling variance of this estimate is E [0 — £(0)]2, and the mean square error is E(Q— 0)2. Clearly E (e - 0)2 = E [0 - £(0) + £(0) - 0]2 = E [0 - £(0)]2 + B* + 2B £[0 - £(0)]
= £[0-£(0)]2 + £2 Thus we see that the mean square error of an estimate is the sum of its variance and the square of its bias. 1.7
Expected Value of the Sample Mean
We write
(10)
BASIC theory: simple random sampling
11
where j>,* stands for the value of the i-th unit of the population, and the summation is taken over all the n units in the sample. We may number the units in the sample serially, as 1 2, . r, .. n, in the order in which they were drawn, thus writing (10) as
,
. .,
eij.) =-e{)]>V}
(11)
r=l where yr' now stands for the value of the unit included in the sample at the r-th draw. By a well-known theorem in probability, the expected value of a sum is the sum of the expected values. We, therefore, write E(Jn) = (j {£ (Ji) + E Os') + ... +E(J,') + ... + £(j-.')} (12) Now, by definition, JV
E (yrf) = 2 Piryi i= 1
where P{r denotes the probability of drawing a specified unit yi at the r-th draw. We have seen in Section 1.2 that, in simple random sampling, this probability is equal to 1/JV. It follows, therefore, that 1
E^')
=
^
-p 2-* i= l
*=■(r =l-2- • • • ■> b)
, i=i
J
where at- = 1 if yi is in the sample, = 0 otherwise. From (15), taking the expected value of both sides, we write
i andyj, we have, by defini¬ tion, N
Etjr'y*') =
2
P{r-pjs\i-y*yj
(21)
=/
wnere Pjs (denotes the probability of drawing yj at the s-th draw, given that j>i is drawn at the r-th draw. Now, from Section 1.2, we have
1
BASIC theory: simple random sampling
13
and, by an extension of the same result, Pjs I *' = JfZTl
(23)
Hence, substituting for P^ and Pjs, t- from (22) and (23) in (21), we get 1 EUrW) =
JV
2
yw
(24)
It follows that
^)' E{t»A=w=T)E\'i M »5*t K
K
1
lr^s=l N
JV(JV- 1). 2 yiyj ' x^j-l
(25)
The result can be alternatively established as follows: We have
»0?=Tj E{ £*■»} = i^rrij [. 2, *(««) Jw]
(26)
«Vj-/
n where the summation 21 extends over the n{n— 1) product termsyiyj in iy£j the sample, and a;, aj are defined in Section 1.7. Now,
£(«,«,-) = PUi*j = l) = P(«i = 1). P(ocj- = 1| a,- = 1)
(27)
where P(aj = 11 a,- = 1) denotes the probability of including yj in the sample, given that rj is also in the sample. Clearly, we have P(«.j =
11 a; == 1)
(28) M - 1
It follows, therefore, from (27) that n(n — 1) £(*»«/) =
JV(JV- 1)
(29)
Hence, on substituting from (29) in (26), we have
^rAp‘4= blj {(l/w)-l/i2}
JV(JV- 1) _y
-y*
a_£! JV
(30)
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
14
Now, using the results in (20) and (30), we have fl
"
l2
i^j
j
i^j
-?[■{ ’•■+( 1 US'2 =;»2 +1 (i-i) Next, we obtain the expected value of the sample mean square.
E (j2>
= E {^-zri (X** ~ "*"2)}
= ,7rT
-nE W)]
- „-Ti {”**2 + (* - jf) s2 -ny2 = S2
521 (32)
showing that s2 is an unbiased estimate of S2. 1.9
Sampling Variance of the Mean
Let V(yn) denote the sampling variance of the mean. Then, by defini¬ tion, we have V(y„) = E[{yn - £(y,)}»] = £(}>„») - {E(y,)f
(33)
Substituting from (14) and (31), we obtain (34) which can also be written as X-n =
X
S2 ■ n
(35)
The reader may verify that the value of the sampling variance derived from this formula is the value actually obtained in Table 1.1 for a sample of size 2.
BASIC theory: simple random sampling
15
The factor (JV — n) /JV in (35) is a correction for the finite size of the population and is called the finite population correction factor or simply the finite multiplier. When n is small as compared to JV, the multiplier will not differ much from unity and the sampling variance of the mean will approximate to that of the mean of a sample drawn from an infinite population. Usually, the value of S2 will not be known. Its estimate from the sample will, therefore, be used in estimating the sampling variance. Thus Af .— « Es,. F(fc) - —
(36)
and the estimate of the standard error is given by Es,
(37)
However, the estimate in (37), as will be clear from Section 1.10, has a slight negative bias. 1.10
Expected Value and Sampling Variance of s
We have seen in Section 1.8 that the sample mean square s2 provides an unbiased estimate of S2. To obtain the expected value and variance of s, we proceed as follows. Let s2 = S2 T £
(38)
E(e) = 0 and £(fi2) = V(s2)
(39)
where
We may therefore write s = (S2 + E)1'2
-o+jr
(40)
Since £ will be small as compared to S2 with a probability approaching 1 as n becomes large, we may expand the right-hand side as a series. We then have *=s{l+t-£-i-Jl + -}
(«)
Neglecting powers of £ higher than the second and taking expectation on both sides, we obtain
16
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
£(s) s S [l - 4.
(42)
To obtain the variance of j2, we have by definition
V^) = £{(,*)«}-[£(,«)]»
-
E {S j? - n yff - S*
(43)
Expanding and taking expectations as in Section 1.8, an expression for F(j2) can be derived. The calculation, however, involves much heavier algebra than in the case of the mean. The calculation of the sampling variances of higher order moments is even more laborious. Their deriva¬ tion is greatly facilitated by the use of monomial symmetric functions (Sukhatme, P. V., 1938) or the use of multivariate symmetric means developed by Robson (1957). The discussion of these methods is however beyond the scope of this book. We merely note that the variance of s2 can be derived by using any one of these methods and that the expression for the variance in the limiting case when W tends to infinity is given by
H
F(,2)
~ H-22
[
n
(44)
Using the Pearsonian notation for departure from normality, this can be written as (45) where '
P2
=
hIs*
(46)
For the normal population, p2 = 3, so that F(,2)
2S4 n — 1
(47)
and SM. (.2)
=
yi
^ -_.52
(48)
Expected values of higher order sample moments and their products have been worked out and tabulated for ready reference by Sukhatme P. V. (1944). y me, Before proceeding to obtain the sampling variance of s, we first observe from (42) that * will under-estimate 61, although, if n is large, the bias
BASIC theory: simple random sampling
17
will be negligible. To obtain the variance of s, we have V{s) =E{s - .£(s)}2
= £(^)-[£(r)]2
~ S2
s2
-
1
i m
8
S*
J
V(s2) = 4S2
(49)
Substituting from (45), we have
V(s)
s*ra,-i
2
p- + -T— n
4 L
n
(50)
n(n — 1) J
When the population is normal, j32 = 3 and we obtain
V{s) s
S2
(51)
2(» - 1) and S (j) s
(52)
V2(n- l") 1.11
Confidence Limits
The standard error gives an idea of the frequency with which errors (differences between the sample estimate and the population value) of a given magnitude may be expected to occur if repeated random samples of the same size are drawn from the population. Usually errors smaller than the standard error will occur with a frequency of about 68 per cent, and those smaller than twice the magnitude of the standard error will occur with a frequency of about 95 per cent, provided the estimate is approxima¬ tely normally distributed. In general, if the sample size is not too small and JV is large and if the estimate under consideration is a linear unbiased estimate of the population parameter, then the frequency with which errors will exceed a fixed multiple of the standard error of the estimate is approximately equal to the frequency as determined by the normal law. Erdos and Renyi (1959) and Hajek (1960) give the necessary and sufficient conditions under which the distribution of the sample mean tends to the normal distribution. Consequently, from a knowledge of the standard error of the estimate and with the help of the normal pro¬ bability integral tables, we are in a position to locate the actual unknown value of the parameter within certain limits with a known relative frequency. To take the example of estimating the population mean, we know that the mean of a random sample will be approximately normally distributed if the size of the sample is not too small and if the
18
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
population from which it is drawn is not very different from the normal. We may, therefore, expect that
s
(53)
on an average in 68 out of 100 occasions, and
(54) on an average with a frequency of about 95 out of 100. In general, we can expect the inequality
y* ~~ *(«» 03)
V
jYn
S ~
—•>'« +
»)
y
' j\fn~ S
(55'
where £(«,«>) is the value of the normal variate corresponding to the value 1 — a/2 of the normal probability integral, to hold on an average with a probability 1 — a. The two limis, on either side of the population mean in (55), are called the confidence limits and the interval between them the confidence interval. The probability with which the inequality holds, viz., 1. — a, is termed the confidence coefficient. It should be noted that the confidence limits may vary from sample to sample. Thus the confidence limits for the six different samples men¬ tioned in Section 1.2 at the 68 per cent and 95 per cent confidence coeffi¬ cients work out as shown in columns 2 and 4 of Table 1.2. Table
1.2
CONFIDENCE LIMITS FOR DIFFERENT SAMPLES MENTIONED IN TABLE 1.1
Sample No. (1)
1-8, 1-7, 1-2, 2-8, 2-4, 3-5,
3-2 4-3 7-8 4-2 7-6 7-5
0-3, 0-8, 2-3, 1-3, 2-8, 3-3,
4-7 5-2 6-7 5-7 7-2 7-7
(5) ©
3-6 4-1 5-6 4-6 6-1 6-6
(4) ©
1-4, 1-9, 3-4, 2-4, 3-9, 4-4,
(3)
CM
3
6
(2)
Confidence Limits (1 - a = 0-95) Based on S Based on s
— 6-0, — 18-0, — 1-0, — 13-0, — 8-0,
12-0 27-0 8-0 23-0 19-0
1
1 2 3 4
Confidence Limits (1 —a = 0-68) Based on S Based on s
It will be observed that in four out of six cases, the population mean is contained within the confidence limits given in column 2, while in all the six cases it is contained within the limits shown in column 4, as is to be expected. The result is of course fortuitous in view of the small
BASIC theory: simple random sampling
19
size of the population but it serves to demonstrate the meaning of the inequalities above. When S2 is not known, Wo use its estimate s2 obtained from the sample. The statement in (55) with S2 replaced by its estimate s2 will, however, no longer be exact. To obtain the confidence limits in this case, we make use of the result that {yn —yu)IS.E.(yn) is approximately distributed as Student’s t with {n — 1) degrees of freedom when n is not too small and the original distribution is not far removed from the normal. If we denote by ^(a,«—l) the value of t corresponding to the level of significance a for (« — 1) degrees of freedom, it follows that we may expect the inequality =1 < !(«,„-1) - 11
(56)
Nn~ S to hold on the average with probability (1 — a). The (1 — a) confidence limits when the size of the sample is not too small and the population from which it is drawn is not very different from the normal are, therefore, given approximately as
1) W-
~+
V^vVz- 5
^57'
For the six samples in Section 1.2 and for the same confidence coefficients as given above, these confidence limits based on s2 are given in columns 3 and 5 of Table 1.2. The values of f(,32, l) and /(-os,1) have been interpolated from the /-table, being 1-85 and 12-7 respectively (Fisher and Yates, 1938). 1.12
Size of Sample for Specified Precision
Almost the first question which a statistician is called upon to answer in planning a sample survey is about the size of the sample required for estimating the population parameter with a specified precision. The precision is usually specified in terms of the margin of error permissible in the estimate and the coefficient of confidence with which one wants to make sure that the estimate is within the permissible margin of error. Thus, if the error permissible in the estimate of the population value of the mean is, say, syN and the degree of assurance desired is 1 — a, then clearly we need to know the size of the sample so that P {| Hence, from (55), we have
— Tn I ^ s^n} =«
(58)
20
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
i2(0L,
CO
) ‘S'2
e2
y*2
± t%, JV
CO)
82
&
(59)
yN2
The determination of the size of sample from (59) presumes the knowledge of *S/yN, the coefficient of variation for the population. This can only be roughly estimated. Consequently, (59) can give only a rough idea of the size of the sample required for estimating the popula¬ tion mean with a specified precision. We can, however, improve upon the predicted value of n as follows: Although the size of the sample is determined from (59), the confidence limits after the survey is completed are obtained from (57). In other words, n could be more precisely evaluated from t2
(a,n—1)
S
yN2
n = 1
(60)
_L /3(“> «-l) + JV s2 Jn2
_L
had t (a,«—i) been known, which it is not, as it itself depends upon n. As a result n is underestimated since /(BjW) is less than t («,n_i). One may correct for this by increasing the value of n in the ratio ^2(«,n'_i)/t2(a, »), where n' is evaluated from (59), but the correction is not likely to be important unless n is small. The calculation of n from (59) also assumes knowledge of S when the error, SyN> permissible in the estimate of the population value of the mean is given, although the confidence limits after the completion of the survey are calculated from (57) which makes use of j-. An allowance for this inaccuracy can be made by making use of the idea, originally due to Neyman (1934), of selecting a preliminary sample for improving the sampling design of the survey (Sukhatme, P.V., 1935). Let be the size of the preliminary sample and j^2 denote the estimate of S obtained therefrom. Then the additional sample required for esti¬ mating the population value with the desired accuracy, assuming A" to be large and £yN to be given, will be n — nwhere t2{oL,nx-\) S-f
It has been shown by Stein (1945) that n so estimated by such a twosample procedure satisfies the statement in (58) and, on the average, gives a more accurate confidence interval than a single-sample proce¬ dure when S is unknown, but further discussion of the problem is beyond the scope of this book.
BASIC theory: simple random sampling
1.13
21
Hyper-Geometric Distribution—Two Classes
We shall now consider the theory of simple random sampling plied to qualitative characteristics. Consider, first, a situation in the sampling units in a population are divided into two mutually sive classes, class 1 consisting of units possessing the attribute consideration, and class 2 consisting of those not possessing it.
as ap¬ which exclu¬ under
Let p and q denote the proportions of sampling units in the population belonging to class 1 and class 2 respectively. Evidently, Np will be the number of sampling units in the population belonging to class 1, Nq the number of sampling units in class 2, and Np + Nq = N. Now, clearly, the probability P (/q) that in a sample of n selected out of N by the method of simple random sampling, nx will occur in class 1 and n2 in class 2 is given by
The variate n1 or the proportion n^n is said to be distributed in a hyper¬ geometric distribution. Further, we have
where the summation is taken over all admissible values of nv As N tends to be large, the distribution (62) approaches the binomial, the probability of observing n1 in class 1 and n2 in class 2 in a sample of n being now given by
(" )?">(!-/>)”■ 1.14
Mean Value of the Hyper-Geometric Distribution
By definition, E
(»,) = 2 nx P (n,) ^
Np!
Nq!
n! [N — n)!
= f, H n,! {Np - H)! n2! (JV? - n2)! _ Nnp
(,Np - 1)!
~N\
Nq\
(in — 1)! (JV— n)! Wl («1 — 1)! (JV> — /Zi) 1 »2!(JV? —»a)! (N—l)\
22
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
(Np-l\(Nq\ np £
\ nl~ V\”2/
(?:!) Now the expression inside the summation sign represents the probability that in a sample of n — 1, n1 — 1 will fall in class 1 and n2 will fall in class 2. Consequently, the sum of all such probabilities over all possible values of nx is unity. Hence E (nx) = np
(64)
Or, denoting by pn the proportion in the sample, we write E(pn)=p
(65)
It follows that an unbiased estimate of the proportion p in the population is given by the proportion in the sample. In other words, Est. p — — =pn n
(66)
Est. q — — = qn n
(67)
Similarly,
1.15
Variance of the Hyper-Geometric Distribution
By definition, V (nx) = E (a,*) - {E («,)}*
(68)
Now
_
2 =r.n1(n1~ 1) -f nx
(69)
so that V (aA) - E {ni
- 1)} + E (%) - {E (%)}2
(70)
Also E
{nx
(wj — 1)} = S
nx (n1
— 1)
P (nx)
_ n (n — l) J\fp (J\fp — 1) M(N- 1)
x
_jy?!
n, («i — 2)! (Np — «x)! (h — 2)! (JV — n)!
n2\ (Nq - n2)!
_.”(”- L)
ty W -
JV(V-L)
(Np ~ 2)!
(N - 2)!
1) ilX>
since the sum of the terms under the summation sign is evidently 1. Substituting the result in (70), we have
BASIC theory: simple random sampling
23
Tr, \ n (n — 1) Np (Np -- 1) 9 . V («x) - —-xrfJ TV-- + rip - n2p2 N(N- 1) N—n
npq
(72)
JV- 1
It follows that the sampling variance of the estimated proportion is given by
(73) and the standard error of the estimated proportion is given by S.E. (pn) —
T-'— ' N— 1 n
(74)
These results can also be obtained directly from those of the preceding sections. All that one need do is to adopt the convention of scoring the characteristic of a sampling unit with one whenever it appears in class 1 and zero when it falls in class 2. On making these substitutions, we obtain -
_ 1 ~
£
__ Np _ A
N
~
(75)
M ~P
(76)
Pn = - 2 Ji = — = n n N y Vi2-MyN2
S2 =
JV
Np - JV>2
i=l
/)(1 -p)
(77>
A
JV- 1
JV — 1
nx r2 ^
Zy .2
nyn2
Tl — 1
72 — 1
72 — 1
Aj(1
/*n)
(78)
On substituting the above in (14), we reach the result (65); and on substi¬ tuting in (35), we get the expression for the variance of the sample pro¬ portion, as in (73). Further, from (32), we get
E{^TpAl-pn))=W=ip^-p)
(79)
It follows from (79) that an unbiased estimate of the product p(\ — p) is given by
12 (JV - 1) Est. {p( \ — p)} =
(72 - 1)JV
Pn{ 1 - Pn)
(80)
24
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
and not by jbn (1 — pn) as one might suppose. The unbiased estimate of the sampling variance of a proportion in terms of the sample values is, therefore, given by
t/yt, \
JV-n pa{\ -pn)
Est. V(Pn) = — -^-j-
(81)
and an estimate of the standard error of pn is given by Est. S.E. (/.„) = fc-" A(1~ A) ' JV n— 1
(82)
This estimate, however, has a negative bias. 1.16
Confidence Limits and Size of Sample for Specified Precision
The confidence limits for the proportion are derived on the same assumptions as for the quantitative characteristics, namely, that the sample proportion pn is normally distributed. This will approximately be so, unless p is too small (or large) and n is smail. The limits are given by
y
'JV — n p{ 1 — p)
P — Pn zb ^(a, cn
JV — 1
n
(83)
where /(a, os) denotes, as before, the value of t. corresponding to the significance level a and oo degrees of freedom. This can be solved
1
For N large and g not too small, n is simply given by *2
n -----
(«»”)?
&2P
Example 1.3 Material for the construction of 5000 wells was issued during the year 1944 in a certain district as part of the Grow-More-Food Campaign in India. The list of cultivators to whom it was issued, together with th proposed location of each well, is available. A large part of the materia] was reported to have been misused by diverting it to other purposes. It is proposed to assess the extent of the misuse by means of a sample spot check. In other words, it is proposed to estimate the proportion p of wells actually constructed and used for irrigation purposes. The sample is proposed to be selected by the method of simple random sampling from the total population of wells for which the material was issued. The permissible margin of error in the estimated value is 10 per cent and the degree of assurance desired is 95 per cent. Determine the size of sample for Values of p ranging from 0-5 to 0-9. We are given JV = 5000, g = TO and 7(aj ») = T96. Substituting in (85), we obtain for different values of p, the following values of n. p n
0-5 357
0-6 244
0-7 159
0-8 94
0-9 42
Since the worst critics do not place the misuse at more than half of the material issued, a sample of 357 would appear to be adequate for the proposed check. 1.17
Generalised Hyper-Geometric Distribution
We shall now extend the preceding results to the population which is divided into k mutually exclusive classes. Let Ni denote the number of units in the z'-th class of the population (i = I, 2, ..k), so that k
XNi=JV .= 7
26
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
Now, if a simple random sample of n is drawn out of this population, it can be seen by analogy with the distribution of two classes that the probability that m units occur in the i-th class and (n — m) in all the other [k — 1) classes together is given by
/JVA/JV-jVA P{n,) =
\rti I \ n — rii )
(86)
~TT
More generally, it can be seen that the probability that in class 1, n2 in class 2, . . ., and nk in class k, is given by
P {«!, n2,
nk) =
®(5)- (5) C)
units occur
(87)
Since (86) is of the same form as (62), it follows from (64) and (72) that E(rii) = npi
(88)
where pi = JVi/JV and V{m) =
N-n npi( 1 — pi)
1
(89)
It should be pointed out that the numbers falling in any two classes are not independent of each other, since k £ rii —n i—1 A measure of dependence is provided by the coefficient of correlation Pi; defined as the ratio| of the covariance between the numbers in the two classes to the product of their standard errors, where
Cov(»j, nj)
=--
E(riinj
)
—
E(m)
•
E{nj)
(90)
Thus
Pij = Cov(»t-, nj)fVV(tii) ■ V(tij) _ E(ntrij)
—
E(fij)
•
E(nj)
(91) • V(nj) Now EUitij) — £ riinj P(rii, tij, n-— iii — nj) hj
27
BASIC theory: simple random sampling
2
\ni )\rij )\ n — rii - «/ / n jij
A)
A »/ — 1 / \ #/ — 1 / \w - »« — wj7
n(n ^ JV-jV-2 JV(JV- 1) hj
(92) JV(JV- 1)
^
the summation being taken over all admissible values of ra2- and nj. Hence, substituting from (88) and (92) in (90), we have Cov(n„ nj) =
n(n — 1)
....
JV(JV- 1) JV ■— n JV - 1
tJ'J
kj2
JV2"'1'3 (93)
npipj
Therefore, from (89), (91) and (93), we get (94)
Pii
1.18
Vn (1
-Pi) (1 ~ Pj)
Expected Value and Variance of the Proportion in One Class to a Group of Classes
The more general problem in the study of qualitative characteristics is the estimation of the proportion of the number in any one class to the number in a combination of some classes containing the class under consideration. Suppose, for instance, the population consists of paddygrowing fields in a district, classified into four classes as shown below, and that a sample of n from all JV fields has given the results indicated below the corresponding population numbers. Irrigated Not Sub-total Manured Manured
Population Sample
(1)
(2)
TVii «n
TV« «12
TVi «i
Unirrigated A, Manured1
_ .NOt , Manured
(3)
(4)
Nn
TV* 2
n21
n2i
Sub-total
TV* «2
Total
TV n
The problem for consideration is the derivation of the expected value and the variance of, say, the sample proportion of manured fields among
28
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
those that are irrigated, i.e., nxxlnx. In this case, both the numerator and the denominator are random variables. The expected value is, therefore, best obtained in two stages, by first keeping nx fixed and thereafter varying it. Thus (95) where E{nxxlnx\nx] denotes the conditional expectation of nxxlnx for given nx, and the second E denotes the expected value of the function in nx so obtained. By definition,
eMnA = ZP{nwn12\n1}nf 1
\
J
(96)
n\
77u
Now
P{n 11’
W12 I nl}
denotes the probability that in a sample of nx (fixed) taken without replacement, nn will fall in class 1 and n12 = nx — nxx in class 2. Thus,
(xii \ P{n 11’
W12
nx) =
V ”11 ) V »12 /
(97)
5
( ) In other words, nxxlnx follows a hyper-geometric distribution in samples of nx drawn from Nv Hence, from (96), we obtain nn
ni
JVn ni}
(98)
jsrx
x. It follows that
E(
where
V _/’n nx J
(99)
~ Pi
JV
and p-. — —1
—
Fl
■#i
N
FlF
1-1 tq
i-1
II
nxx lnx, we proceed similarly. By definition,
-sl'tW" ll-l L l We have seen that nlxJnx
N 2
^11
Nxl 11
N 2
h\\ - ft
(100)
follows a hyper-geometric distribution in
29
BASIC theory: simple random sampling
samples of nx drawn from JVX. It follows that Ki
•^li J_ JVX nx
ni\ =
W
Pll J_
(101)
px nx
and from (71) we have
KK,-!)
1 lJ
nx2
1
(«! - 1 )NXX
(102)
nxNx(Nx -1)
Substituting in (100), we have
N 2 Jvn
V [= E (1_~~ 0 ,_}_ -^n Vnx)
nx) JVX(JVX - \)
LV _
1)
nx Nx
N*
. -^H / , __ ^11 ~ 1 ^ rp / J_\ _
' NX{NX-1)
JVX\
Nx-\)
W-4) \F( M
jvx(jvx-\)
1
\nx)
n_
Mx 2
1 ]
Uj/
(103)
jvJ
Now let
nx =npi + Z where
E{ e) =0 and jV— w £(S2) =.
¥i(! -Pi)
JV- 1
Since 8 will be small as compared with npx, with a probability approach¬ ing 1 as n becomes large, we may write,
1 * npx i- f,l n
8
8
npx ‘
n%2
Hence
■(\nx) -LU-L npx
E
(104)
to a first approximation, or more precisely (105) Substituting from (105) in (103), we have
V
(?)*
Pn. J\ px
~~
JVX- 1
i _1_\_i
N ~n
\npx
JV —l n2px2 '
Npx
1
n
Replacing Nx — l, N— l by jVls JV respectively, we get
l \
Pl) (106)
30
SAMPLING
THEORY
r
~
OF
SURVEYS
JV^n
1 \ n1) ~
N
■
WITH
APPLICATIONS
Jn(i
1
npx ?ip1 /ij px\l
_M
pxJ px
(107)
where Vx denotes the first approximation to the variance, and JVV,
&■)>
. —Ai(i
w
npx px \
Pn Pi
(108)
1 H—~ (1
npx
where V2 denotes the second approximation to the variance. 1.19
Variance of a Function of Two Random Variables
In the previous section, we have worked out the variance of nxx/nx, a function of two random variables starting from the basic definition of variance given in (33). As we shall be dealing with functions of two or more random variables very extensively, it is more convenient to use the following theorem. Theorem The variance of f (x,y), a function of two random variables x and y, not necessarily independent, is given by V[f{x,y)'\ =E[V (f(x,y) \y)] + V[E(f(x,y) |_y)]
(109)
Proof: By definition, we have V[f(x,y)] = E[f (x,y)]2 — [E (/(x,y)]2 Writing for convenience f{x,y) as f we have
r[f]
=
£[/2]-{£[/]}3
= £[£(/* P)]-{£[£(/b)]}2 = £[£(/2 b)] - E[E(J\y)f + E[E(f\y)Y - {£[£(/b)]}^ = £[£(/2b) ~{E{f\y)Y] + V[E(f\y)] = E[V{f\y)] + V[E(f\y)}
Q..E.D.
Applying this theorem to find the variance of nxxJnx considered in the previous section, we have
Since E
j nx
=£[F&h)]+ K[£teh)] ^ is independent of
nx,
the second term is clearly zero.
Whence, using (72), we have _ p l^i \nx)
ni Nxx
[JVX- 1 ‘ Wj '
Nx — JVXX
JVX
1 | ' nxi
31
BASIC theory: simple random sampling
JVn(JVi-JVi1)r
/ 1 \
^(jVi-1) L
\nj
which is the same as that given in (103). 1.20
Inverse Sampling
If the proportion p is very small, which will generally be the case if the attribute under consideration is a rare one, the method of estimation of p described in Section 1.14 may be unsatisfactory. Even a large sample may not provide an estimate with any reasonable degree of precision. In such cases the method of inverse sampling can be used with advantage (Haldane, 1946; Finney, 1949). In this method the sample size n is not fixed in advance. Instead, sampling is continued until a predetermined number of units possessing the rare attribute have been drawn. Let p denote the proportion of units in the population possessing the rare attribute under study. Evidently, JVp units in the population will possess the rare attribute. To estimate the proportion p, the sampling units are drawn one by one with equal probability and without replace¬ ment. Sampling is discontinued as soon as the number of units in the sample possessing the rare attribute is a predetermined number m. Denote by n the sample size required to be drawn to obtain m units pos¬ sessing the rare attribute. Then the corresponding probability distribu¬ tion P(n) of the random variable n is given by
(In a sample of n— 1 units drawn' P(n) = P from N, m— 1 units will possess the rare attribute
The unit drawn at the' • P • tt-th draw will possess the rare attribute
/ Np V Nq \ \m - 1 J\n — m)1 Np — m + 1 N — n + 1 (U ^-i ) Since the possible values of n are m, m + 1, m S P{n) = 1
(110)
-T 2, .. ., m -f Nq, we have (111)
We shall show that an unbiased estimate of p is given by „ m— 1 Est. p = -r = p, say. n— 1
(112)
Thus, /m — 1\
x.
m — 1
■
(\m — 1 )f *9 ^ J\n — m)
("^Tj
Np — m + 1
32
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
(NpV m
lV *9
2 / \n — m ) JVp — m -f 1
~ jl, pp) _ h
—
+i
x-»
=P
(113)
since the terms inside the summation sign may be obtained from (110) on replacing JVby JV — 1, JVp by (JVp — 1), m by (m — 1) and n by (n — 1), and according to (111), the sum of these terms is evidently unity. A Now we shall estimate the variance ofjfr. By definition
v(p) =E(p)-f r\
An unbiased estimate of V (p) is therefore given by Est. V (p) =p2 — Est. (p2)
(114)
To obtain Est. (p2), we have
p f (ot— 1) (m — 2) [ l(» — 1)(»-2)J
v (m— \){m — 2) n^m (»— 1)(h — 2)
( #P V ^ \ \m-l)\n—mj JVp— m+1 ^ JV ^
JV —
R-f-1
{#P-*V Mq \
_p(JVp — 1)
\ w — 3 )\n — rn) JVp -m+1 ^JV—2^ JV — n + 1
JV— 1
JV 9 1 —-f)^ .-t,
JV- l1
JV- 1 5
Hence, on using (113) an unbiased estimate of />2 is given by Est. (/>2) =
JV — l JV
{m — l)(m-2) \){m -2) (m
1
(n1)(» -2) + JV («-!)(»
m— 1 (115)
n- 1
Substituting from (115) in (114), we have Est. V(p) =
(m— l)2 l)2
JV— 1
(m — l)(m-2)
1
JV
(it-l)(*-2)
JV
(»*- 1) (w - 1)
(116)
As JV tends to be large, the probability distribution of n given by (110) approaches the well-known negative binomial distribution, namely
' n •— 1 \ i m nn — m pm q m - l)
(117)
It can be seen that in this case Est. (p) =--/> =
m
(118)
33
BASIC theory: simple random sampling
while (119) It is easy to see that the latter result can also be obtained by letting N tend to infinity in (116). 1.21
Quantitative and Qualitative Characteristics
We will now extend the preceding theory to the situation involving both quantitative and qualitative variation together in the same problem. This situation is of common occurrence. Thus, in a population survey we may be required to estimate both the proportions of families in dif¬ ferent income groups as also the total income in each group. The tabula¬ tion on a sampling basis of census results presents similar problems. Suppose punched cards, each one representing the data of different holdings in an agricultural census, are available for sorting and tabula¬ tion. Further, suppose that the holdings are to be classified according to their size in five classes: 0 — 2*5, 2-5 — 5 0, 5 — 10, 10 -25 and larger than 25 acres. We may be required to estimate proportions of holdings in the different classes and also the total area under any speci¬ fied crop in each class. In all such problems, it is convenient to select a sample of n out of the total of jV by the method of simple random sampl¬ ing. We have already considered the problem of estimating the pro¬ portions in the different classes. The problems for consideration now are:
(a)
to obtain an estimate of the total (or the average) of the quantitative characteristic under study in each class;
(b)
to obtain the standard error of the estimates in
(c)
to predict the size df the sample n required for estimating the total in each class with a given standard error.
(a);
and
Without loss of generality, we may consider these problems in relation to an actual example.
Example 1.4 It is proposed to estimate the area benefited from irrigation wells said to have been completed under the Grow-More-Food Campaign from the data given in Example 1.3. The sample is proposed to be selected by the method of simple random sampling from the population of wells reported to have been constructed. The number of wells actually constructed is not known. How large should n be in order that the area benefited may be determined with 5 per cent standard error?
34
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
Let
JV = Number of wells reported to have been constructed under the Grow-More-Food Campaign;
p — Proportion of wells in the population actually constructed; for convenience we will designate these wells as belonging to class 1; and q = Proportion of wells not completed. We will designate the wells under this category as falling in class 2. Evidently,
Np + JVq — N Let, further yNj, Sj2 be respectively the population mean area benefited per well and the population mean square for class 1. Let nx denote the number of wells falling in class 1 when a random sample of n is chosen by the method of simple random sampling from the population JV, and y„t be the corresponding mean area in the sample. Our first problem is to obtain the estimate of the total area benefited, namely, JVpyNi. Since the sample is chosen by the method of simple random sampling from the entire population, the sub-sample nx can also be considered a random sample from the corresponding population of JVp units. It follows that for given nv yUi will be an unbiased estimate of yNl. • It is, therefore, natural to take JV-
71 n n
• yn as the estimate of the total 1
area benefited from the completed wells. It is easy to show that this is an unbiased estimate of JVpyKi. For
)) ^
E K) JV
= 7^
np
= Mpy^ To obtain the variance, we have on using (109)
(120)
35
BASIC theory: simple random sampling
JV2 {£ [F^K)] -f V[E{niyni\nx)}}
JV* ~n?
+
rw-Ji
JV2 \NpE(nx) - E(n*) ‘V + Jn ,2 pw}
JV>
r
Substituting from (64), (68) and (72), we have on simplification F(JV-i-y„J =,JV(JV‘
n
-
N2(N - n) p(\ -p) _
*)
n(M~ 1) -l)^2+^(l-/-)j;Nl2}(123)
For purposes of simplification, we will assume that JVp is large enough to permit the following approximations:
JVp — 1 JVp
~ 1
and
JV- 1 JV
~ 1
Using these, we obtain
+
1 -/'W}
(124)
36
SAMPLING THEORY OF 9URVEYS WITH APPLICATIONS
From (123), it follows at once that an unbiased estimate of the variance is given by Est. V
= N2
N-n /2 N
n
rvr2 n{n — 1) N{N — n)
1 2
(2fc)*l
»(«-!)
n
J
(125)
To predict the size of sample required for estimating the characteristic with a given standard error, we need the expression for the relative variance. This is obtained by dividing (124) by N2p2yK2 and is given by N-n
1 \C,2 ,
1
*nl7 + _T)
N
(126)
where Cj2 denotes the square of the coefficient of variation of the area irrigated from a well in class 1. For N large, the relative variance is given by
1 n
(127)
An idea of Cj2 may be formed from previous experience. Let us assume it to be 0-5. Since n will need to be large as p decreases, p may be assumed to be the smallest of the values consistent with expectation and previous experience in order that we may err on the safe side. Table 1.3 gives values .of n for different values of p and for C-,2 =0-5 in order that the area benefited may be estimated with 5 per cent standard error. Two sets of values of n are given: (i)
those obtained from (126), for N =5000, and
(ii)
those from
(127), i.e., after neglecting the finite multiplier.
It will be seen that a sample of 690 wells will be required for estimating the area benefited with a degree of accuracy as large as the one specified or larger, assuming of course that p does not fall below 0-5. Ignoring the finite multiplier altogether would imply a loss of nearly 15 per cent of the information. "We may call attention to one important point. It will be seen from (124) that the sampling units falling in a given class alone contribute to the information in that class. It follows that this formula for the sam¬ pling variance is applicable to any class, even when the population
37
BASIC theory: simple random sampling
Table
1.3
SAMPLE SIZE REQUIRED FOR ESTIMATING THE TOTAL IN CLASS WITH 5 PER CENT STANDARD ERROR
p (i) (ii)
0-5
0-6
0-7
0-8
0-9
690
536
419
327
253
800
600
457
350
267
consists of several classes, p in that case representing the proportion of units in the population falling in the given class, and (1 —p) representing the proportion of units falling in all the remaining classes together. It also follows that the value of n required for estimating the class areas with a specified accuracy or higher is the value corresponding to the smallest of the p values.
EXERCISES 1.1
In simple random sampling the probability of selecting any given n
units in succession in a specified order is, by definition,
1
1
1
N ’ JV— 1 ’ ‘ ' JV— n + 1
✓ jV'N
Hence, show that everyone of the (
J possible clusters of size n has an equal
probability of being selected. Show further that if simple random sampling were to be defined as a method /jV"\
of selecting n units such that everyone of the [ n ) possible clusters has an equal probability of being chosen, it implies that the probability of selection assigned to each available unit of the population at the first and each subsequent draw in unit-by-unit selection is equal.
1.2
The following procedure has been used for selecting a sample of fields
for crop-cutting experiments on rice. “Against the name of each selected village are shown three random num¬ bers smaller than the highest survey number in the village. Select the survey numbers corresponding to given random numbers for experiments. If the selected survey number does not grow rice, select the next bigger rice¬ growing survey number in its place.” Examine whether the above method will provide an equal chance of inclu¬ sion in the sample to all the paddy-growing survey numbers in the village, given the following:
38
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
1.
Name of village
.Payagpur
2.
Total number of survey numbers
3.
Random numbers .18, 189, 239
4. Rice-growing survey numbers
. .290
....49 to 88 and 189 to 290.
Show that the survey number 49 has a chance of 49/290 of being included in the sample, the survey number 189 a chance of 101/290, while the remaining survey numbers have a chance of only 1/290 each. (i.C.A.R., 1951}
1.3
The following method is laid down for locating and marking a random
plot of area 33' X 33' in a field selected for crop-cutting experiments in India. “Stand facing North with the field in front of you and to your right. Measure the length and the breadth of the field in feet and deduct 33' from each. Select a pair of random numbers less than or equal to the remain¬ ders so obtained to locate the corner of the plot. Fix a peg at this corner, tie a string to it and stretch it along the length of the field away from the South-West corner of the field. Measure 33' along it by means of a tape and put the cross-staff at this point. Turn the string round the cross-staff and stretch it at right-angles away from the South-West corner of the field and measure 33' along it. Proceed in this manner until you reach the start¬ ing point of the plot by checking the distance between the fourth and the first corner.” A rectangular plot of area 33' x T is to be selected from a rectangular field of area 120' X T by the method described above. Show that the number
g(i) of plots in which the z'-th unit area gets selected is given by g(i) — i = 33 =
121—z
for
1 < i < 32
for 33 < z < 88 for 89 < z < 120
Obtain the corresponding results when a rectangular plot of area 33' X33' is to be selected from a rectangular field of area 120' X 100'. Generalize the results to the case when a plot measuring a' X b’ is to be selected from a field measuring L'
x B'. (aggarwal, 1969)
1.4 If/(x,y) and g(x,y) are functions of two random variables x and y, not necessarily independent, show that Cov [/(*,/, g{x,y)] = E {Cov [/(*,/), g (*>y) |/} + Cov {£[/(*,/) |/|, E [g(*,j) |/|} 1.5
Consider a simple random sample of size 2 drawn from a finite popula-
tion (yi,yz,y3). Corresponding to the three possible samples, sx = (yx,y2),
sz =
y$) and = (y3,y3), let a linear estimate e(s) for estimating the population mean be defined as follows:
39
BASIC theory: simple random sampling
«0i) = ! yi + b’2
(Dd)fd where f —
J
(d) mnD
MN
If M, N, m, n and D all increase without limit in such a manner that
.
remains constant equal to X, show that \d g~\
P (d)->
d\ (deming
1.10
and
glasser,
1959)
Obtain the variance of the estimate yu considered in Exercise 1.7.
Show that an unbiased estimate of the variance of yu is given by
Est. fwo = (f-i) e where j2 is the sample mean square based on u distinct units.
REFERENCES 1.
Aggarwal, O. P. (1959) ‘Bayes and minimax procedures in sampling from finite and
2.
Aggarwal,
infinite populations 1% Ann. Math. Stat.,
30,
206-218.
O. P. (1969) “On the biases in sampling a plot from a field”, J. Ind. Soc.
Agri. Stat. (under publication). 3.
Bartlett,
4.
Deming,
M. S.
(1937)
W. E. and
Glasser,
J. Amer. Stat. Assoc., 5.
Des Raj
and
‘Sub-sampling
Khamis,
54,
for attributes’, J.
Roy. Stat. Soc.,Suppl.
4, 131-135.
G.J. (1959) ‘On the problem of matching lists by samples’, 403-415.
S. H. (1958) ‘Some remarks on sampling with replacement’, Ann.
Math. Stat., 29, 550-557. 6.
Erdos,
P., and
Renyi,
A. (1959) ‘On the central limit theorem for samples from a finite
population,’ Pub. Math. Inst., Hungarian Acad, Sci., 7.
Finney, D.
8.
Fisher,
R.
4,
49-57.
J. (1949) ‘On a method of estimating frequencies’, Biometrika, A.
36,
233-234.
(1922) ‘On the mathematical foundations of theoretical statistics’, Phil.
Trans. Roy. Soc., London, Series A, 222, 309-368. 9.
R. A. and Yates, F. (1938) Statistical tables for biological, agricultural and medical research, Oliver and Boyd Ltd., London.
Fisher,
42 10.
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
Godambe,
V. P. (1955) ‘A unified theory of sampling from finite populations’, J. Roy.
Slat. Soc., Series B, 17, 268-278. 11.
Hajek, J. (1960) ‘Limiting distributions in simple random sampling from a finite popula¬
tion’, Pub. Math. Inst., Hungarian Acad. Sci., 5, 361-374. 12.
Haldane,
J. B. S. (1946) ‘On a method of estimating frequencies’, Biometrika, 33,
222-225. 13.
Horvitz,
D. G. and Thompson, D. J. (1952) ‘A generalization of sampling without replacement from a finite universe’, J. Amer. Stat. Assoc., 47, 663-685.
14.
I. C. A. R., New Delhi (1951) Sample surveys for the estimation of yield of food crops (1944-49), Bulletin No. 72.
15.
Neyman,
J. (1934) ‘On the two different aspects of the representative method: The method of stratified sampling and the method of purposive selection’, J. Roy. Stat. Soc.,
16.
Rand Corporation
17.
Robson,
97, 558-606. (1955) ‘A million random digits’, The Free Press, Glencoe, (Illinois).
D. S. (1957) ‘Applications of multivariate polykays to the theory of unbiased
ratio type estimation’, ,7- Amer. Stat. Assoc., 52, 511-522. 18.
Roy,
J. and
19.
Stein,
Chakravarti,
I. M. (1960) ‘Estimating the mean of a finite population’,
Ann. Math. Stat., 31, 392-398.
C. (1945) ‘A two sample test for a linear hypothesis whose power is independent
of the variance’, Ann. Math. Stat., 16, 243-258. 20.
Sukhatme,
P. V. (1935)
‘Contribution to the theory of the representative method’,
J. Roy. Stat. Soc., Suppl. 2, 253-268.
21.
Sukhatme,
P. V. (1938) ‘On bi-partitional functions’, Phil. Trans. Roy. Soc., London,
Series A, 237, 375-409.
22.
Sukhatme,
P. V. (1944) ‘Moments and product
moments of moment-statistics for
samples of the finite and infinite populations’, Sankhya, 6, 363-382. 23.
Tippett,
L. H. C. (1927) ‘Random sampling numbers’, Tracts for computers, XV, Cam¬
bridge University Press.
32
CO
o CO
CO
—
05
CM
to
—
29
05
to CM CM r^.
28
-•
r-»
27
CO
—
26
«o
CO
25
CO
CM
CO
—
24
CO
to
—
CM
23
30
CO
o
CO
—
CO
05
CO
CO
o» CO
—
to
to
CM
o
—
CO
o.
to
CO
CO
CO
to o
05
CO
CM
05
CO
—
CM
05
05
05
o* O' to 00 O'
05
O'
CO
05
CM
CM
tO
o
CO
o CO o O'
o — *-
to
to
CO
CO
CO
CO
to
CO
CO
CM
CO
05
CM
05
to
O
"t1
CO
CM
o CM CO 05 CM
CO
co CM co
o CO O'
05
CO
tO
co
-«
CO
CO
o
co CO to o to
CO
05
CO
to
05
O'
o
to
o co
05
o CO
to Tt< CO CO to
CO
O'
o CM
o
co CO to
CO
CO
CM
05
tO
CO
T*
CO
CM
05
CM
CO
05
to
o CO CM CO
CO
CM
o* CO CO o
CM
-«
CO
22
CM
CO
co
to CO
to to CO
to
CO
to
O
CM
co CO CM CM CO
CM
to
to
CO
—
CO
CO
05
05
CO
CM
CO
—
co o to CM
CD
CD
CO
o
to
O
CO
o 05
CO
—*
o
r^
CO
—
t^. o
co CO
05
CO
CO
o 05 r^. CM CM
r^ CO
CO
CO
CO
o CO CO
CM
CM
to to o
o
to
co CO
CO
o
20
CO
05
co 05
CO
CO
r-
CO
CO
O'
CO
tO
CM
CM
CM
CO
-
CM
CO
- 05 05
05
05
o CO 05 O'
o o> CO -
CO
CM
CO
CO
-
co CM
CM
CM
to
CM
to 05 o> o
CM
co to
CO
CO
o
CO
05
-
o
05
CO
-
CO
to
CM
05
CO
-
co to 05 CO 05
■'f -
o CO CO CO CM
rh* CO CO
CO
CO
05
CM
CO
CO
O'
CO
CM
-
05
to CM CO
o to co
-
-
CO
-
o
Th«
-
CM
CO
CO
CO
o
05
CO
to CO 05 CM
o CO CO
05
CO
CO
CM
05
to
to
co to CO co
CO
CO
CO
o -
-
CO
CM
O
05
CO
o O to
CO
o
CO
to
to CO CO
05
to ■'f CM o rh
05
CM
r—•
to co 05
-
to o
CO
CO
05
o co 05 CO
to - CO 05 CO
CM
CM
CO
O'
to O
CO
tO
o to - co O
CO
05
’’f
O
CO
-
-
CO
05
-
CM
CM
O
CM
05
-
-
co
co 05 o* CO
CM
co CM O'
05
to
to o CO
-
CM
CM
CM
CO
- CO
to -
CO
CM
CM
o- tO
o
CO
CO
CO
05
CO
to
-
co CO
CO
to CO
CO
to CO O' CO
05
to o CM o
CM
to CO
(M
co
co o>
M
W
^
tO
CO
^
CO
05
05
to
CO
05
CO
05
-
CM
CO
o 04 CM
rh"
CO
to
CO
CM CM
o o
CM
o CO
o
-
CM
05
o CO to to
CO
CO
05
o CO
CM
05
CO
o
o to
-
CO
tO
CO
-
to to CO
tO o 05 05
to
CM
tO o
CO
to
CO
CO
CD to - CM CO
CO
05
CM
CO
CM
05
CO
05
05
to - CM 05 05 CO
CM
05
CO
CM
to CM - to 05
CO
o
05
—
CO
O'
-
CM
CM
o to
o
CO
to CM to
CM
o - CO -
CM
CM
CO
CO
to
CO
CO
TABLE
CO
CO
OF
RANDOM
NUMBERS
CM
iO
CO
CO
to CO CO o CO
tO
CD
CO
-
- to
to
CO 05 o
—« CM
04 CO ^ tO CM CM CM CM
1 i
CM CO
o
—
in
o
CM
CO
CO
CM
!
CO o CO
m
CO
CO
o
05
-
o
rh
CM
o
co
m
n-
CM
CO
o
m
—
CO
CO
CO
i 1
CO
CM
CO
CO
CO
r^*
05
CO
m
t^*
CO
co
CO
CO
CO
o
CO
co
CO
o
CM
05
co
Tt
r^
in
m
CO
m
m
Th
05
m
co
rh*
co
-
CM
o
CM
m
05
r^-
-
CO
co
r-
CO
co
05
CO
co
m
05
m
CO
-
CO
CO
-
CO
05
o
CO
05
CO
CM
m
CO
CO
CO
CO
co
co
CM
05
o
o
m
CO
*n
r-
co
CO
m
m
rf
CO
CO
CO
CO
co
CO
CM
CO
CO
CM
^
co
CM
CO
m
CO
co
05
CO
o
m
CO
CM
CO
m
05
CM
CO
05
-
m
CO
CO
CO
o
00
o
co
m
05
co
t"-
CO
o
r^-
CM
CM
o
05
co
co
CO
-
CO
co
05
m
CO
CM
m
CO
o
co
CO
Th
CO
05
-
o
m
m
m
CO
-
CM
co
co
m
m
CO
CO
CO
CO
co
CO
-
co
-
o
CO
CO
r^ co
CO
CM
m co
m -
CO
CO
»
-
05
CO
CO
co
CO
05
05
CM
CO
co
05
m
co co
CO
00
05 CO
o
CM
CO
r^>
r—
=
(Si)
-
P'
(Si)
Pj) y. + (1
-
Pi)
N (2 - Pi - Pj)
(53)
59
SAMPLING WITH VARYING PROBABILITIES
Now, from (30), we have
fj
(54)
p))%{y. -- y)
(55)
Est. and
Est.
{v u (»,)]}
= 4^5
(1
-
Further, after some algebraic reduction, it can be shown that
M 2 p W U A.) i—
±z(Si)p (J-)1
4..V2 (2 - P; - Pj)2
\Pi
Pj)
It follows on substituting these results in (46) that Vf Est. {V \z (“)]}
(1 ~ Pj ~ Pj) (1 ~ Pj) (1 jy2 ^2
Pj)(yi
_ PA2
PjY
(57)
\Pi
In a similar manner, unordered estimates can also be built up from the ordered estimates suggested by Das (1051), Sukhatme (1953) and others. 2.7
Horvitz-Thompson Estimate
From the theory of unordered estimates developed above, it is clear that such estimates have limited applicability in the sense that they lack simplicity and the expressions for the estimate as also for the esti¬ mated variance become almost unmanageable when the sample size is even moderately large. Horvitz and Thompson (1952) suggest a simpler estimate which we shall present below. Let the population consist of N units and yi as before represent the value of the characteristic under study for the z-th unit in the population (i = 1, 2, . . ., JV). Suppose that a sample of size n is drawn without replacement, using arbitrary probabilities of selection at each draw. Thus, prior to each succeeding draw there is defined a new probability distribution for the units available at that draw. The probability dis¬ tribution at each draw may or may not depend upon the initial proba¬ bilities at the first draw. Define a random variable 2, • •., JV)
(59)
where we assume that every unit has a positive probability of being in¬ cluded in the sample, i.e., E (a,) > 0 for all i.
60
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
We shall show that the simple arithmetic mean of the Zi provides an unbiased estimate of the population mean vN . Making use of the definition of a,, we can write \
Zn r=
n
1
JV
(60)
2 Zi ==
S OCj Zi n l==/ 1 N 1 N ny; Hence E (z„) =- S Zi. E (*,-) = - S /f-rr. E (a*) =yN n i=i n i=i NE (a,-)
n
(61)
To obtain the sampling variance, we have V (zn)
= £ (^n2) — ^N2
(62)
Now, using (60), we see that 1
/ JV
\2
£(4*2) 1 =
—2
n2 1
=
r jv 1 S a/2 £;2 +
~2
™
U=l
jv’ S
i*j=i
r jv
i a; ay £2 Zj f
jv
1
Lt‘=7
E
l
(*f) ^'2 + . S
i?±j=J
E
(63)
ai) «i
A? S N i=i
Pi
(2L
\JfPi
(107)
Since sampling with probabilities proportional to size is used normally in situations in which yi is approximately proportional to Pi, we shall investigate whether the inequality (107) is true under the following super-population model proposed by Cochran (1953). Let yi=rPi
where e; is independent of
E(ei) = 0;
+ ei
(108)
Pi. In arrays in which Pi is fixed, assume that E(ei2) - aP{g
a > 0, g > 0
(109)
Then it can be seen after some algebraic simplification that the inequa¬ lity (107) holds provided
Gov(Pi} Pig-J) >0
(110)
which will be true if g > 1. In practice, because of the positive correla¬ tion that usually exists between neighbouring elements, g is likely to be greater than 1. As such, systematic sampling with random ordering is
72
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
likely to provide a more efficient estimate than Rao-Hartley-Cochran procedure in many situations that occur in practice. However, the expression for the variance in the case of systematic sampling with random ordering and given by (91) is only valid for JV sufficiently large. Attempts were therefore made to remove this diffi¬ culty. An exact formula for the probability of including a pair of speci¬ fied units in a sample of any given size and for any JV has been recently given by Connor (1966). Brewer (1963), Fellegi (1963) and Durbin (1967) have proposed new methods of selecting units with varying probabilities and without re¬ placement such that the probability of inclusion of the z-th unit is pro¬ portional to Pi. These methods are thus similar to the one suggested by Narain (1951). Brewer’s method differs from that of Narain in that he approximates the conditional probability
PJ\ i =
p* rj
Pj by 1 — P* ~J 1 - Pi
Under this assumption, it is possible to obtain exact values of the revised probabilities PJ . The feature which distinguishes Fellegi’s method from other similar methods is that the probability of the z-th unit being selected is equal to Pi at each of the n successive draws. Durbin’s method is equivalent to that of Brewer in the sense that both the methods give the same joint probabilities of inclusion. Hartley (1966) has considered systematic sampling with unequal probabilities and without replacement when the units in the popu¬ lation are listed in order of their size. With this modification, it is possible to obtain an exact expression for the variance of the estimate. However, estimation of the variance is not possible unless certain as¬ sumptions are made. Hanurav (1967) has proposed a criterion to improve the stability of the variance estimator and developed a method satisfying this criterion except when the largest size is markedly different from the next largest size. 2.12
Comparison of Different Sampling Procedures
As we have already noted, none of the procedures described above can be considered to be entirely satisfactory from the point of view of precision and applicability in practice. In fact, the non-existence of a uniformly minimum variance estimate, in the entire class of linear un¬ biased estimates, has been proved by Godambe (1955) and Koop IT963). More recently, Godambe and Joshi (1965) have extended these results to the entire class of unbiased estimates, removing the restriction of linearity. In what follows we shall consider some typical populations of size JV = 4 to obtain an idea about the relative performance of the
73
SAMPLING WITH VARYING PROBABILITIES
various sampling procedures for samples of size 2. To facilitate discus¬ sion, appropriate estimates of the population mean corresponding to different sampling procedures for samples of size 2 are listed below. (0
Sampling with replacement 2
Tx
yi JVPi
= *S
where the summation is taken over the units drawn in the sample. (ii)
Ordered Des Raj estimate
1 +
Pi)
+
yj
(i Pj where j j and yj are the values of the units drawn at the first and second draw respectively. T2
(in)
Unordered Basu-Murthy estimate
#0 T
-pj)
+f (i
-Pi)
_£i_ JV(2-Pi-Pj)
3
where yi and yj are the values of the units drawn in the sample in any order. Horvitz- Thompson estimate
1 r
!
1
+
P;
1
1
yj
r
1_
i
N
yi
f
i—
^4
i
(iv)
Pj
,
Ls+1
Pj
I
i-pjL
where yi and yj are the values of the units drawn in the sample in any jv Pi order, and S — 2 --— • i=l A m (y)
Adidzuno system of sampling: Horvitz-Thompson estimate
T -^-‘f 5
M
*_ +_Zl_1
l(JV— 2 )Pi
+
1
(2V—
2)Pj
+ 1J
where yi and Vj are the values of the units drawn in the sample according to Midzuno system of sampling. (vi)
Midzuno system of sampling with revised probabilities: Horvitz-Thompson estimate 1
^
,
yj
74
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
where ji and yj are the values of the units drawn in the sample according to Midzuno system of sampling with revised probabilities of selection P{ such that £(«;)
(vii)
{(^-2)^' + 1} =2Pi (i = 1, 2,
JVarain method of sampling Tn =
JV).
o
_L I n . u 2JV 1 P{ ‘ P.
where yi and yj are the values of the units drawn in the sample with revised probabilities of selection P* such that JV
£(«;)
(nn)
s
P* 1 —Pf
+
1
Pt
JV).
2Pi (i=l, 2,
-
1 - Pi J
Systematic method of sampling
where j, and are the values of the units drawn systematically with varying probabilities. (ix)
Rao-Hartley-Cochran method of sampling T - 1
9
where and respectively.
( ■*
jv Ian
|
X
are the values of the units selected from the two groups
The three populations we shall consider are those studied by Yates and Grundy (1953) and Des Raj (1956), and are given in Table 2.2 along with their common set of selection probabilities. Table
2.2
THREE POPULATIONS OF SIZE N = 4 Serial No. of the unit 1 2 3 4
Pi
Population A yi
Population B yi
Population G yi
0-1 0-2 0-3 0-4
0-5 1-2 2-1 3-2
0-8 1-4 1-8 2-0
0-2 0-6 0-9 0-8
75
SAMPLING WITH VARYING PROBABILITIES
The variances of the sample estimates of the population mean (multi¬ plied by 42) corresponding to different sampling procedures for the three populations are given in Table 2.3. Table
2.3
COMPARISON OF DIFFERENT SAMPLING PROCEDURES Variances of estimates Sampling Procedure
Population A
Population B
Population C
0-500 0-365 0-312 0-806 2-880
0-500 0-365 0-312 0-045 0-384
0-125 0-088 0-070 0-058 0-240
1 2 3 4 5 6* 7 8 9
—
0-323 0-367 0-333
—
0-269 0-367 0-333
—
0-057 0-033 0-083
♦This sampling procedure cannot be adopted as the initial probabilities of selection do not satisfy the condition P{ >
(»- 1) n(N— 1)
It will be observed that in the case of population B, sampling with varying probabilities and without replacement is always more efficient than sampling with varying probabilities and with replacement while in the case of the other two populations this is not necessarily the case. Of all the estimates, the most efficient estimate is the Basu-Murthy un¬ ordered estimate in the case of population A and the Horvitz-Thompson estimate in the case of population B while the one based on systematic method is the most efficient estimate in the case of population C. It will also be seen that as is to be expected, the Basu-Murthy unordered esti¬ mate is always more efficient than the ordered Des Raj estimate. It is, however, interesting to observe that the reduction in variance brought about by unordering is hardly appreciable. The Horvitz-Thompson estimate under Midzuno system of sampling seems to be the least efficient estimate in the case of populations A and C while in the case of popula¬ tion B it is no better than all the other estimates except the one based on sampling with varying probabilities and with replacement. As is to be expected, the last three procedures are always more efficient than sampling with varying probabilities and with replacement. Of these, Narain’s technique seems to be generally more efficient than the other two. Of the last two procedures, it has been shown by Rao, Hartley
76
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
and Cochran that the systematic method is likely to be more efficient especially if JV is sufficiently large. The above results show that this may not be the case if JV is small. In view of this as also the difficulty of esti¬ mating the variance unless JV is large and the fact that Narain’s technique is not easy to apply in practice, the last procedure suggested by Rao, Hartley and Cochran may be suitable from the practical point of view. However, it can hardly be recommended from the point of view of precision. Besides, it is necessary to investigate the stability of the vari¬ ance estimator before recommending any particular estimator. A large scale study of the stabilities of estimators and their variance estimators is necessary before any definite conclusions can be drawn. Some empirical studies in this direction have recently been carried out for samples of size 2. For this, the reader is referred to Rao and Bayless (1967).
EXERCISES 2.1 If units are drawn one by one with varying probabilities and without replacement, and at any draw subsequent to the first draw, the probability of selecting a unit from the units available at that draw is proportional to the probability of selecting it at the first draw, show that for samples of size two, the Yates and Grundy estimate of variance is always positive. 2.2 Consider a sampling system where the first two units are selected with varying probabilities and without replacement, the probability of selecting a unit at the second draw being proportional to the probability of selecting it at the first draw, while the remaining (n— 2) units are selected with equal probability and without replacement from the remaining (jV— 2) units. Obtain expressions for E{ai) and Efactj). Hence, or otherwise, show that the Yates and Grundy estimate of variance is always positive. (rao, 1961)
2.3 Let xi (* = 1,2, ..., JV) denote the size of the i-th unit in a popula¬ tion of size JV. Then under the model yi = (It* -f ei where E(ci | xi) = 0, E(ei ej | xi, xj) = 0 and V(ei\xi) = axiS for a > 0, g ^ 0, and if E stands for expectation over all finite populations that can be drawn from the super¬ population, show that, for samples of size 2, (■)
X‘E{V(T,)\ = 2aXS 2 PH~' Pj 1L- PJ~ i
.
. , 2(JV- 1) + « (n - 1) JV
£ yi Zj\ iV)
whence an unbiased estimate of yNj-2 is given by Est. (yNf2) = yn? — Est. V (yn{) =y»i2
.2
(W,)-
(50)
Similarly, an unbiased estimate of yN2 is given by Est. (yN2) = yw2 — Est. V (yw) ■yw2 —
(51)
s (— — i=l \Tli JVj/
Hence using (49), (50), (51) and the fact that Sj2 provides an unbiased estimate of >SJ2, an unbiased estimate of S2 is given by JV
Est. S’2 j,
S pi (.ym—yiv)2
+
S Ml-*) (^--4) \rii JViJ
-.2
(52)
i=l
Putting jV; = JV/>2, the estimate (52) simplifies to JV Est. S’2 = S j&pj2 -f^ pi(yni JV — 1 L ,-= / ' i=i
yw)2
2 />f(l — pi) — i=r ' ' ' rii J
(53)
Hence, from (48), we see that an unbiased estimate of V(yn)R based on a stratified sample is given by JV — n Est. V(y,)R -J(„~jvf),J;M2 +
(JV- 1);
'-j
i= 1
Pi(yni
k , 2 2 Ml -/>*)-
i= I
yw)2
(54)
rii J
An unbiased estimate of V(yw) is given by (8). Hence an unbiased estimate of the reduction in variance due to stratification is given by E«. [V(y„),-V(yw)s] = 2 i=l \n
+
n-i)
JV — n (JV- 1 )n
S Pi(yni
i= 1
yw)2- s pi{i - pi) i=i
(55)
and the ratio of (55) to (8) expressed as a percentage gives an estimate of the percentage gain in efficiency due to stratification.
93
STRATIFIED SAMPLING
These results assume a particularly simple form in the case of pro¬ portional allocation when the value of jbi is the same as that of the sample mean yn. It will be seen that, in this case, the first term of (55) becomes zero and the formula can be written as 1 k - 2
N Est. [V (yn)R — V Cy«;)P]
(X-K
nl (yni
Ln i=i
yw)2
-1 s (i_ *),,■»](56) n i=i \
n /
J
Assuming that the mean squares within different strata are equal, i.e., Si2 —Sw2 tsay) for i = 1, 2,..., k
(57)
a better estimate of Sw2 can be obtained by pooling the sums of squares within strata for the sample. Thus, k
Est.
=^2 =-71
K
ni
S 2 (yjj-yni)2 i= l j
(58)
It is easily verified that sw2 is an unbiased estimate of Sw2 and hence of each Si2 when (57) holds. Hence, a better estimate of the variance of the sample mean when a stratified sample with proportional allocation is available, is given by Est.
N —n
(59)
k
2
ni (yni — yw)2 =(k—l) n
Sb2
(60)
i= 1
where n =n\k. Substituting sw2 for each Si2 in (56) and using (60), we get at once Est. {V(yn)R - V(y„)r] =
^
(61)
The estimate of the relative gain in efficiency due to stratification is thus obtained as the ratio of (61) to (59), namely, Est, {F(yn)R — F(yw)P} _ Est. V(yw)p
— 1)
(62)
(JV-1)« k—
ir
2
nsb
(63)
n The quantities sw2 and nsb2 are called the mean squares within and be¬ tween strata respectively, and are best calculated from what is familiarly known as the analysis of variance table given below:
94
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
Source of Variation
D.F.
Between Strata
k- 1
Within Strata
n
Mean Square
Sum of Squares
k
—
-
i— 1
k
«t
Uni - ynY~
k
ni
L
£ Vj-jni)*
5w
i—1 J
n
Total
—
1
k
ni
£
2
{yij-ynY
s2
i= 1 j
The efficiency of stratification is sometimes calculated directly by comparing the overall mean square s2 with sw2, the relative gain in pre¬ cision (R.G.) being given by r2 R.G. =
1
(n — k)sw2 + (k — 1 )nsb2 =
1
(n — l)st k — 1 n — 1
(
(64)
The estimated gain in precision is -— times the estimate of gain given by (63) and thus not materially different in large samples. It should be remembered that these results hold only in the case of proportional allocation and when it is assumed that the mean squares in the different strata are equal. If either of these conditions is not satisfied, the ratio of (55) to (8) should be used to estimate the relative gain in efficiency. 3.8 Post-Stratification for Improving the Precision of a Simple Random Sample Stratified sampling presupposes the knowledge of the strata sizes as well as the availability of a frame for sampling in each stratum. However, the latter is not always available. For example, the classification of a population by age is known from the census tables although the lists of persons belonging to different age groups may not be available for the selection of samples from the different age groups. Consequently, it is not possible to know in advance to which stratum a sampling unit belongs until it is contacted in the course of the survey itself. While the sample in such cases has necessarily to be selected by the method of simple random sampling, we can always classify the selected sample by the strata and treat it as if it were a stratified sample. In this section we shall examine the gain in precision arising from such post-stratification.
95
STRATIFIED SAMPLING
If the sample is to be treated as if it were a stratified sample, then the weighted mean yw would be the appropriate estimate of the population mean. This is easily seen to be an unbiased estimate of the population mean, since, for each i E(yni) = E{E(Jni\m)} =E{yui) =yNi
(65)
Hence, E(yw) — For fixed nv n2,
k k S pi E(yni) = 2 pi yNi =yN i=l i=l
(66)
nk, the variance ofyw is given by (7). Thus,
V(yw | nv n2,
(67)
In order to examine the gain in precision from post-stratification, we must find the unconditional variance of y w to make it comparable to the vari¬ ance of yn, the mean of a simple random sample. Using the conditional formula for the variance given in Section 1.19, V(yw) = E [V(y w \ nv n2, ..., nk)] + V[E{yw \ nv n2,..., rc*)] ^E [V(yw | nv n2, ...,?ik)]
(68)
since E(yw\nv ...,nk) is a constant independent of rii. From (67) and (68), we see that k V(yw)
1
2
pi2Si*
(69)
i=i
An exact expression for (69) cannot be obtained. However, for large values of n and jV, we may use the result (105) of Chapter I, and write 1 — pi
(70)
n2pi2 Substituting from (70) in (69), we have pi
V(Pw) = i?,[np,+
1 4
1 Npi -
= (\n - NJi=i ) S PiS?+
4
Pi2 Si
2 (> — A) s? n£ i=i
(71)
The first term in (71) is the variance of the mean of a stratified sample taken with proportional allocation. The second term represents the adjustment due to post-stratification. Since the second term is small in comparison to the first when n is large, we see that post-stratification with a large sample is almost as precise as stratified sampling with proportional allocation. The result appears to be reasonable since a large sample is expected to be distributed in various strata in proportion to their sizes.
96
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
3.9
Effect of Increasing the Number of Strata on the Precision
of the Estimate The variance of the estimate of the population mean from a stratified sample depends upon (z)
the strata values of the p; and the Sj, and
(m)
the sample sizes nr
We shall assume that rii is proportional to pi, so that the variance will now depend only on the strata values of the pi and the Si, being given by
The smaller the strata the more alike will presumably be the sampling units comprising them and the smaller, therefore, will be the values of the Si2. We may, therefore, expect that under proportional allocation the precision of the estimate of the population mean will generally increase as the number of strata increases. In the case of post-stratification, the effect of incr easing the number of strata is best studied with the help of (71). The first term in this equation, it will be noticed, is identical with (72) and will presumably decrease as k increases. On the other hand, the contribution of the second term to the variance of yw will increase as k increases. For N large and .S';2 equal to, say, Sw2 for all i, (71) may be written as V(y„)~-n\^Sl*+k-^-SuA
(73)
Now, Sw2 will ordinarily decrease as k increases but (k — l)^2 will increase as k increases, at a rate ordinarily greater than the rate of decrease in Sw2. We conclude, therefore, that, for given n, a stage in the value of k may be reached beyond which stratification may not add to the precision of the estimate. The problem of determining the optimum number of strata was considered by Dalenius (1950). On the basis of certain empirical studies he postulated that the variance of the estimated population mean in the case of stratified sampling with k strata is inversely proportional to k2. Thus Vk{yw) =A/k2
(74)
The constant A is determined by using the value 1 for k, which gives A = V,(yw) = V(j[f(^ c..2 i
Pi.
Pv (Ay—A.)2
V (ys) S -i S pij Sij2 + \ s Ay (Ay —A.)2 n
F(ys)—V(yu) =
(170)
j Pi.
1 and
y
j
(171)
1 YSAj(y.j-y..)2-f pij ^tj_^'-)2(172)
Since the first term is always non-negative and the second term is of an order higher than — j it follows that two-way stratification is more effin cient than one-way stratification. Even if the first term is zero, the loss in precision will be at the most 1
118
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
The estimation of the variance in the case of two-way stratification is much more complicated and will not be discussed here. We shall merely mention that it is possible to estimate the variance provided each row and each column has at least two observations. For details the reader is referred to Bryant (1955). 3.13
Allocation of Sample Size to Strata with Several Charac¬ teristics
In sample surveys, one is generally concerned with the problem of estimation of several population characteristics. Often it is found that these characteristics make conflicting demands on the design. A proce¬ dure that is likely to decrease the variance of the estimate of one charac¬ teristic may very well increase the variance of the estimate for some other characteristic. In this section we shall review some of the procedures mentioned in literature to resolve this conflict when several characteristics are to be estimated simultaneously. The problem was first considered by Neyman (1934). He pointed out that the variances of different variables are positively correlated if the variables themselves are positively correlated. When this is the case, Neyman allocation for any one characteristic will also be reasonably effi¬ cient for other characteristics. If, however, the different characteristics are not correlated, he suggested that the sample may be distributed among different strata in proportion to their sizes. Peters and Bucher (undated) suggested that in such cases the compromise allocation may be deter¬ mined by maximizing the average efficiency given by rf k
\2
1
Is n
h h
1
* Pi2 S{2 i=l
rii
_
where S denotes summation over the h different items under study. Another criterion suggested by Geary (1949), Cochran (1953) others is to minimize the sum of relative variances, namely, V —
is minimum. Now can be written as
^ = iEj \
~ ^^CiTli [
Ci
(199)
It follows that 4> is minimum when each of the square terms on the righthand side of (199) is zero, or in other words (i =1, 2,
V (x V a
k)
(200)
The constant of proportionality 1 jV'[x is determined so as to satisfy the condition of fixed cost or fixed variance. In the former case, we substi¬ tute for ni from (200) in (198) and obtain c
1 V [X
Vjq
~ k /s pi Viz V Ci
(201)
i=l
get Pi Viz ^o
m Vct
(202)
(ii; piVci)
which is seen to be similar in form with the corresponding result in Sec¬ tion 3.3. When ^ = c, or, in other words, when the total size of sample is fixed, we see from (200) that the optimum allocation is given by:
126
SAMPLING THEORY OF SURVEYS WITH APPLICATIONS
Pi a,iz
(203)
£ Pi Giz i=l For the alternative approach in which the cost is minimized for given precision, the reader may verify that the optimum value of n.{ is given by k
m
v pi &iz^'ci
i=*l
pi aiIZ
(204) Vo
V Ci
where V0 is the value of the variance with which it is desired to estimate the mean. We notice that the optimum allocation is governed by the same considerations as those mentioned in Section 3.3 for simple random sampling. 3.18
Variance of the Estimate under (i) Optimum Allocation, and (ii) Proportional Allocation when the Total Size of Sample is Fixed
For n fixed, the optimum allocation is given by (203). Substituting for rij from (203) in (194), we get F (*»),, =-{ 2 PmX n l i=i J
(205)
For proportional allocation we substitute n\ = npi in (194) and obtain V (Zw)p
1
k £
= -
n i=i
(206)
pi aiz2
Now (206) can be expressed as
p(^)p=~r{
£
nLVi^i
ppi\ +
J
£
i=l
pi{*(£ ?==
Pio j— 1
i pl. 7' 2
l
v/> (£Lz.
i= 1 Pio
i=l
' ■*
_
to a
Y
\*io
(219)
'
Hence, from (210), (214) and (219) we get
F(4„) - Ffe) = j;
-i) + i j,^(| *■ - *•)’
The above formula gives the decrease in variance due to stratification. However, while the second term in the expression (220) is always positive (or at the worst zero), the first term may be positive, zero or negative. It is definitely non-negative in the case of optimum allocation given by 1
k
_
(203), for in this case it reduces to — 2 Pio (piGizlPio — Gwz)2> and is n {— i zero when either the allocation is proportional («* = nPi0) or ni cc pi2diz2IPi0
(221)
Thus the efficiency of a stratified sample would decrease as the alloca¬ tion departs from the optimum. The variance of a stratified sample is obviously the least when the allocation is optimum (Neyman allocation) and would increase as the allocation departs from the optimum. For¬ mula (220) shows that stratification does reduce the variance when the allocation is either proportional or according to (221), but it is easy to visualize cases where the allocation may be so poor that the first term in (220) will not only be negative but larger in magnitude than the second term, making stratified sampling less efficient than unstratified sampling. For the special case when P{0 = pi, (220) takes the form k
V (zn) - V (zw)
/ 1
1 \
1
k
.2 pi2 Giz2( —-) + Pio—pi
i= 1
\ftpi
fti*
pi (zi. ~ z..)2
ft i= 1
(222)
and with Neyman allocation, the reduction in variance is given by V (Zn)
V (Zw) n JPio=pi
= ; s Pi ft i—1
- °wz)2 + \ 2 pi (zt. - z..)2 ft i=l
(223)
129
STRATIFIED SAMPLING
3.20
Estimation of the Change in Variance due to Stratification
In this section we shall estimate the reduction in variance of the estimated mean due to stratification when a stratified sample is avail¬ able. As (220) can be written in the form:
v (z„)
-
i
V (lw) = i— 1 p?