143 45
English Pages 213 [221] Year 1965
SAMPLING METHODS AND CENSUSES
SAMPLING METHODS AND CENSUSES
S. S.
by ZARKOVICH
Chief, Methodology Branch Statistics Division, FAO
FOOD AND AGRICULTURE ORGANIZATION OF THE UNITED NATIONS Rome, Italy
First printing 1965 Second printing 1969 Third printing 1975
©
FAO 1965
Printed in Italy
PREFACE
This book has developed from a series of lectures given by the author at various Training Centers and Seminars organized by FAO as part of the preparations for the 1960 World Census of Agriculture. The main aim of these lectures was to promote the application of more recent achievements in the field of survey techniques, principally those involv� ing sampling methods. The application of sampling methods enabled many countries to collect census information for the first time in their history and for others it was a way of achieving substantial savings and other improvements. The first draft of this book appeared in 1957 under the title Sampling methods and censuses, Volume I. Two other drafts have since been prepared in order to incorporate the experiences from further surveys and censuses. However, the essential character of the book remains unchanged. The last decade has seen much progress in the use of sampling methods. It is believed, nevertheless, that interest in a book of this kind has not diminished and that it will be useful to many countries in organizing their 1970 agricultural census. P.V. SUKHATME Director, Statistics Division, FAO
CONTENTS
PREFACE
v
1. MAIN USES OF SAMPLING METHODS IN CENSUS WORK Sample censuses . . . . . . . . . . . . . . . . Auxiliary sample censuses .. . . . . . . . . . . . Broadening the scope of census programs . . . . Sampling methods in tabulation work . . . . . . Censuses and current statistics .. . . . . . . . . . Change surveys . . . . . . . . . . . . . . . . Censuses and statistical research . . . . . . . . . Sampling methods and quality of census data . .
. . . . . . . .
10
. . . .
12 12 14 17 23
. . . .
28 28 33 37 41
4. AUXILIARY SAMPLE CENSUSES Definition . . . . . . . . . . . . . . . . . . . . . . . Illustrations . . . . . . . . . . . . . . . . . . . . . . Designing auxiliary sample surveys . . . . . . . . . . . .
45 45 48 53
5. BROADENING THE SCOPE OF CENSUS PROGRAMS The problem . . . . . . . . . . . . . . . . . . . . . . Some illustrations . . . . . . . . . . . . . . . . . . .
59 59 60
3. SOME COMMENTS ON SAMPLE CENSUSES Advantages of sample censuses.. . . . . . . . . Sampling units . . . . . . . . . . . . . . . . Size of the sample . . . . . . . . . . . . . . General strategy in preparing sample censuses .
. . . .
. . . .
. . . . . . . .
. . . . . . . .
1 1 5 5 6 7 8 9
. . . . . . . .
2. SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS Advantages of a ccmplete count .. . . . . . . . . . Prerequisites of a complete enumeration census . . . Needs . . . . . . . . . . . . . . . . . . . . . . Quality of data . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
viii
--
SAMPLING METHODS AND CENSUSES
Classification of techniques illustrated. . . . . . . . . . . Selection biases . . . . . . . . . . . . . . . . . . . . Choice of technique for selection of the sample . . . . . . 6.
. . . . .
87 87 89 106 113 119
Types of adjustment and their purposes . . . . . . . . . Choosing the type of adjustment .. . . . . . . . . . . .
126 126 135
THE USE OF SAMPLING METHODS IN TABULATION
Advance estimates . . . . . . . .. . . . . Illustrations . . . . . . . . . . .. . . . . Broadening the scope of tabulation programs Integrating the various aims . .. . . . . . . Presenting sample results . .. . . . . . . . 7.
8.
. . .. . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
ADJUSTING SAMPLE RESULTS
SAMPLING ERRORS
Computation of sampling errors .. . . . . . . . . . . . . Presenting sampling errors .. . . . . . . . . . . . . . . 9.
CENSUSES AND SUBSEQUENT SURVEY WORK
Changes in time of some characteristics Change surveys . . . . .. . . . . . . Designing change surveys .. . . . . . . Censuses and current statistics .. . . . . Censuses and research programs .. . . . 10.
73 75 84
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
149 149 172 181 181 183 187 191 195 200
CONCLUDING REMARKS
APPENDIXES
l.
United States Census of Agriculture, 1954 . . . . . . . . .
206
. . . . .
208 208 209 210 210
AUTHOR INDEX
. . . . . . . . . . . . . . . . . . . . . .
211
SUBJECT INDEX
. . . . . . . . . . . . . . . . . . . . . .
212
2. Census of Population in Great Britain, 1951 . . A. Ages (individual years) by marital condition B. Ages (22 categories) by marital condition . C. Ages (18 categories) by marital condition . D. Ages (8 categories) by marital condition .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
1. MAIN USES OF SAMPLING METHODS IN CENSUS WORK 1
One of the most striking features of recent developments in statistics is the rapid growth of interest in sampling methods, affecting both sta tistical theory and practice. The sampling approach has created an enormous number of problems in statistical theory. In the field of applications, the use of sampling methods has resulted in new techniques which have substantially increased the possibilities of modern statistical practice. Censuses are an example in which the use of sampling has given rise to a number of far-reaching innovations. In this chapter a general review will be made of the basic advantages of adopting sampling methods in census work. Sample censuses Complete enumeration censuses presuppose the existence of a certain minimum of facilities, such as funds, professional personnel for planning census methodology and the supervision of field operations, sufficiently qualified enumerators, mapping material, machine tabulation equipment, etc. These facilities or combinations thereof are not always to be had, with the result that a census is impossible. This is particularly the case in less developed countries. In such a situation, where the facilities for a complete enumeration census are not available, the development of the theory and practice of sample surveys now offers a practical alternative in the form of " sample censuses. " In contrast to complete enumeration censuses, which involve obtaining data on census items from all the units of the population, the term sample census is used here to describe the procedure by which the information required on a usual census program is obtained only from a sample of units belonging to the population to be enumerated. Sample censuses, therefore, are essentially sample surveys. Only those sample surveys, however, which are intended to give broad information on a large number 1 This chapter is based on S.S. Zarkovich. Sampling methods and censuses, Monthly Bulletin of Agricultural Economics and Statistics, Vol. 6, No. 1, 1957, Rome, FAO.
2
SAMPLING METHODS AND CENSUSES
of items usually met with in census programs will be called sample censuses. All the other surveys which cover specialized fields, such as areas, yields, livestock, etc., will be called " sample surveys. " Needless to say, sample censuses rely on the same sort of facilities as complete enumeration censuses. Owing to the fact, however, that in the former case these facilities are required to a limited extent, a recourse to sampling methods makes it possible to obtain census information in a number of cases where a complete enumeration census cannot be contemplated. The consequences of the application of one or other of the two distinct methods mentioned must be made clear. If a complete enumeration census is taken, data can be tabulated for any administrative unit, irre spective of its size. In those countries which traditionally take complete censuses, totals for census items are often prepared by units as small as villages or communes. If a sample census is taken, however, sufficiently precise estimates will not be possible for such small units. This is at least true of some characteristics. For this reason, sample censuses can be efficiently used only in those cases where information is needed with ref erence to relatively large provinces. In some fields and under some conditions, neither a complete enumera tion census nor a sample survey would provide the statistical information required. Such is the case with agricultural statistics in eastern European countries. The source of data on agriculture in these countries is current statistics, which often have a broader program than the one normally used in censuses. However, if the alternatives are complete enumeration or a sample census, the view here adopted is that the former should be considered a worthwhile undertaking in any country and should be taken from time to time, say once every ten years. The census is a national historical document which contains a large variety of data on various aspects of a country's social and economic life. If properly tabulated by small enough units, it will show to later generations the condition of a country at a particular point of time, the course of subsequent devel opments, and so on. It thus represents a rich treasury of information which can be consulted whenever the need arises to examine some phe nomenon in the light of history. In the past, sample censuses have mostly been taken where difficulties over funds, personnel, transport and the like have precluded a complete count. They are also considered a logical alternative if circumstances are such that data with acceptable accuracy cannot be obtained unless
MAIN USES OF SAMPLING METHODS IN CENSUS WORK
3
special precautionary measures are taken during the field work. For example, sufficiently accurate data on areas, yields, animal production, etc., can often be obtained only if measurements are used. In such cases complete enumeration censuses may not be worthwhile. It may be that a large enough number of enumerators cannot be trained to perform the measurement operations satisfactorily. In addition, there may be difficulties of time, cost and equipment for such large-scale measurements. In situations of this nature, a sample census is the only rational way of collecting data, in spite of the preference, in principle, for complete cen suses. The same conclusion may be reached if the conditions for census taking are such that extreme methodological flexibility is needed in dealing with various parts of the program. Again, some items may require the use of measurements; others may require a special skill in eliciting accurate and complete information. A long time may be necessary with a further group of items for subquestions supplemented by various checks on the data obtained, etc. A complete enumeration census normally provides for a fairly rigid procedure and therefore it may not be a suitable way of collecting data in a situation of this type. On the other hand, sample censuses can easily be made flexible enough to fit a large variety of cir cumstances. For example, it is possible to split up the collection of data into several surveys taken at different periods of the year and devoted to various parts of the program. The census information is then ob tained by assembling the results of all the separate surveys. Thus it will be seen that in those parts of the world where statistical work is not advanced and where conditions for collecting data are difficult, sample censuses are the logical alternative to complete enumeration censuses. Auxiliary sample censuses In defining the population of holdings to be enumerated in agricultural censuses, the usual practice is to state the qualifying limits in terms of area or of the value of produce. All the holdings below these limits are excluded from the population and left out of enumeration for a num ber of reasons. One such reason may be that these holdings contribute very little to the total of census characteristics and therefore can be neglect ed without seriously affecting the magnitude of totals. Another may
4
SAMPLING METHODS AND CENSUSES
be the advantage of avoiding the complications involved in identifying small holdings, which usually depend more on other branches of industry than on agriculture. If for any reason the holdings below a given limit cannot be neglected, then an adequate procedure may be to take a complete enumeration cen sus of holdings which belong to the population according to the defi nition adopted, and to obtain data on holdings outside the population by means of a sample census. In the terminology used here such sam ple surveys are called "auxiliary sample censuses." In this combination of the two methods, complete enumeration plays a dominant role, because it refers to that group of holdings which are responsible for the basic part of agricultural production. It is also assumed here that the existing needs for data on small holdings do not justify their complete enumeration with all the associated difficulties. In such case, the information needed will usually be reduced to some census items only. Questions on drainage and irrigation, agricultural machinery, quantities of the various crops sold, prices received, etc., customarily asked in complete enumeration censuses, will certainly be of little in terest in the case of small holdings. The whole census program may thus easily be restricted to a few items, such as area, characteristics of people living on the holdings, livestock, number of days worked away from the holdings, etc. This information has a complementary character with respect to the complete enumeration census, and as such, it can be secured economically and with all the necessary precision through an auxiliary sample census. The same technique can also be used in a situation where the lack of facilities, such as funds or sufficiently qualified enumerators, does not permit the taking of a complete enumeration census in its usual form. In this case, a convenient solution may be the complete enumeration of a relatively small number of large holdings and an auxiliary sample census of small holdings. Needs and conditions will determine whether the program of the latter is the same as the complete enumeration census program or whether it differs from it. The use of an auxiliary sample census will only be justified if the facilities used in the part of the pop ulation covered by the sample census do not represent more than a small fraction of what is required for a corresponding complete enumer ation census. This will for the most part be the case if the size distri bution of holdings is such that the number of small holdings is high as compared with that of large holdings.
MAIN USES OF SAMPLING METHODS IN CENSUS WORK
5
Broadening the scope of census programs If the number of items on the census program is large, the whole census operation usually involves high costs and often a number of other complications. The high costs in such cases are partly due to the amount of processing needed and partly to the relatively large team of enu merators and the time required for collecting information. Consider able expenditure under these heads is to be expected, particularly in countries where labor costs are high. The complications resulting from a large number of items on the census program are primarily connected with enumerators, as it is not easy to train a large number of people to collect sufficiently accurate data on a large number of questions. A convenient method of reducing costs and complications in similar circumstances may be provided by a combination of a complete enumer ation census and a sample survey. The former method is used in this combination to obtain information on a number of basic items and the sample survey for the remainder of the program. In other words, the complete enumeration is reserved for items to be tabulated by small administrative units, while the sample survey program is composed of items for which estimates by larger regions are sufficient. If the salaries paid to. the statistics personnel are high, the procedure we describe may result in substantial savings or in an enlarged range of census information for the same budget. The combination of these two techniques makes it possible to broaden the scope of the census program within the same budget. That is why, in recent years, this combination has helped toward increasing census programs to meet the growing needs for statistical data. It should be added, however, that this combination does not cease to have its value even where the salaries of field personnel are nominal or where the enumerators are voluntary workers. The reason for this is the fact that the reduction of the complete enumeration census program may also lead to a reduction in the number of enumerators needed. If so, it makes possible a better selection of enumerators and, consequently, more accurate results. This approach becomes especially attractive if the census is used as an opportunity to get information on a number of more complex prob lems, such as income, output of various livestock products, consumption of foodstuffs, etc. Acceptable accuracy of data on these and similar items requires highly trained enumerators with experience in collecting data
6
SAMPLING METHODS AND CENSUSF.S
of this type, so that they can check the consistency of answers obtained, compare the related questions and, finally, elicit better and more correct information by requesting explanations and additional questions. The work with such an experienced staff of enumerators can hardly be ac complished without recourse to sample surveys. Accordingly, a combi nation of a complete enumeration census and a sample survey may be needed whenever the census program is composed of certain items which require particular skill on the part of enumerators. Sampling methods in tabulation work In many countries, the regular complete tabulation of all the items on the census program, even if standard equipment for a mechanical tabulation is available, takes a long time. If such is the case, urgent needs for statistical information are left unsatisfied and the practical usefulness of the census is seriously diminished. In such a situation, a practical measure may be to take a sample of units enumerated and tabulate data belonging to these units only. By so doing, estimates are obtained of basic census results which are required to serve the most urgent needs before data of the complete tabulation are made available. In other words, the use of sampling methods for purposes of preparation of advanced estimates reduces considerably the delay between the enumeration and the release of data resulting from the complete tabulation. Another field in which sampling methods have been successfully applied is their use in broadening the scope of tabulation programs. An aspect of the increased interest nowadays in statistical data is the necessity for more detailed tabulations either for various decisions on the part of the government or for private, commercial and scientific uses. These needs often call for cross-classification tables, always a delicate problem be cause of the costs involved. Cost considerations make it necessary to reduce to a minimum the number of accepted requirements for various classifications and cross-classifications in the complete tabulation pro gram. However, with sampling methods, it becomes possible to reduce considerably the volume of work and broaden the tabulation program at a fraction of the cost of the corresponding complete tabulation. In tabulation work especially, sampling methods offer substantial pos sibilities for savings. Normally, only part of the existing needs calls for
MAIN USES OF SAMPLING METHODS IN CENSUS WORK
7
complete tabulation. A large number of other needs can often be sat isfactorily met by sample estimates, notably in the case of data related to larger administrative units and intended to serve scientific and ana lytical purposes. During the preparation of tabulation programs, there fore, it is advisable to determine that part of the total program which can be met by sample tabulation. If, then, a combined tabulation is adopted, either the total tabulatioq costs for the same program will be reduced or the program broadened on the same budget. There might also be some interest in cases where complete enumeration censuses were tabulated exclusively by means of sampling methods. This is exceptional, however, since a complete enumeration census is generally and logically followed by a complete tabulation of at least a part of the items on the program. In some instances, this normal procedure has not been possible. Recourse to sample tabulation may then offer the only means of obtaining necessary tables from the collected data.
Censuses and current statistics The information on changes of some characteristics is needed on a current basis. Typical examples of such characteristics are yields and areas harvested. This need is satisfied by current statistics. Censuses represent a convenient moment to look into current statistics. If current statistics are not established, the census might be a good opportunity to do so. Census data will offer considerable material for designing effi cient surveys. For example, efficient labor force surveys can be estab lished by using data from the population census for purposes such as definition of the hierarchy of sampling units, stratification, equalization of strata, estimation of components of variation, selection of units with unequal probabilities, use of census data as supplementary infor mation, etc. Without census data, tremendous resources may be necessary to conduct much less efficient surveys. In other cases, some broadening of census work might be planned in order to get the necessary material for the establishment of current sta tistics. For example, if data from a population census are planned to be published by villages, it may be advisable to have the same information prepared and available by enumeration districts, as these might prove to be very convenient units in current labor force surveys.
8
SAMPLING METHODS AND CENSUSES
It could also be pointed out that censuses provide the ground for check ing the quality of data collected in current statistics for improvement of the accuracy of current statistics or the extension of the program of current statistics on new characteristics. In any of these directions, the census information may be useful. It is therefore a matter of rational planning to see in what way the preparations for a census could be used to improve current statistics.
Change surveys Since the census figures refer to the state of the characteristics included in.the program as on the census day, they may have but little practical value as a source of information on those items which change consider ably in the course of time. Examples of such changes are the seasonal variations in the labor force, changes in the number of livestock, rate of yield, etc. In such circumstances, information on changes is required to supplement the regular census figures. An efficient way of satisfying such needs may be found in combining the complete enumeration census with one or more sample surveys aimed at the estimation of the changes which have occurred. For example, if the census is taken in the winter, it can be planned so that another sam ple survey will be taken the following summer, with a view to estimating the change in labor force characteristics. Such sample surveys are here called "change surveys." Change surveys will not be appropriate if current information on changes is required. It is a convenient method, however, if requirements can be satisfied by information on changt..s supplied over longer periods of time. Such a situation arises in providing the information on changes in characteristics which are usually not covered by current statistics. The use of censuses as an occasion for collecting such information on changes is advantageous because a good deal of preparatory work for the census can be usefully exploited in designing the sample survey. More over, the use of census results makes it possible to prepare an efficient sample design. The cost of the information on changes may thus be reduced considerably. It will further increase the usefulness of both census and survey data because a number of various additional analyses become a practical possibility.
MAIN USES OF SAMPLING METHODS IN CENSUS WORK
9
Censuses and statistical research Great emphasis is being placed nowadays on efficient planning of sta tistical operations. In census work, efficiency is primarily concerned with costs. In censuses a large number of people are employed; cen suses require long and costly preparations; the cost of tabulation equip ment and printing is also considerable. A rational approach in the plan ning of census operations thus means, first and foremost, a choice of methods and techniques which are expected to lead to the established census goals with a minimum total cost. This approach becomes partic ularly important in modern censuses which are putting ever-increasing pressure on national budgets owing to the large public demand for broaden ing both census and tabulation programs. Besides, no planning can be satisfactory if the quality of data is neglected. From this point of view, the aim of modern planning is to choose the technique that leads to the highest accuracy of data for a given budget. Another aspect of statistical surveys that needs to be taken into account is the administra tive convenience of the designs proposed. It is obvious that there is no rational planning without a large number of data bearing on various aspects of methods and questions. These data are often taken from past work. However, in a large number of cases, they have to be collected again because either the conditions are different or there had been no similar survey in the past. In these cases the use of sampling methods becomes essential. Through small-scale pilot surveys, it is possible to collect, at a relatively small cost, any infor mation that may be needed in planning statistical surveys. In fact, this is how a great amount of earlier guesswork was eliminated and replaced by decisions taken on an objective basis. The collection and analysis of data needed for rational planning of censuses and sample surveys makes up the research program of the agency responsible for data collection. If the statistical development has reached even a moderate stage, such a research program becomes a rational investment. The question of a research program is raised in connection with cen suses for several reasons. Censuses involve considerable budgets and it may not be difficult to find some means to divert a part of the resources toward research. An independent effort to provide funds may easily be a failure. On the other hand, censuses always contain a vast variety of information that can be used in subsequent survey work. With a
10
SAMPLING METHODS AND CENSUSES
research program established, this information will be analyzed properly and utilized as the chance arises. Without a research program established, it will be either lost or a fresh and uneconomical search for it will be made each time a new survey is being considered. Finally, if a research pro gram is established at the start of the preparations for a census, it can also be useful for the census that is being prepared.
Sampling methods and quality of census data 2 A great deal of evidence has accumulated recently on the presence of various errors in statistical data. The wording of questionnaires may be ambiguous or misleading; enumerators often have their own opinion on what the answers should be, so that they influence respondents accordingly; in some cases, the respondents do not know the answers, while in others they do not wish to disclose the truth for one reason or another; the atmosphere in which some surveys are conducted is such that it induces error _in the answers; the field work may be inadequately organized; the enumerators may lack the specific training or be wrongly selected; map ping material may be incomplete, with the result that double enumera tion or omissions are likely to occur; editing, coding and punching will also contain errors, as they are inevitable in any type of mass produc tion. Accordingly, there are many sources from which errors may arise. This is why it has become a matter of modern standards in statistical prac tice to conduct quality checks of data or to check on the presence and the magnitude of errors in statistical surveys. The first reason for doing so comes from the obligation statisticians have toward users of the data. Without throwing some light on the quality of data collected there is no way of determining the degree of reliance that can be placed on them. Checking quality of data thus becomes a measure intended to promote the proper use of statistical information. The next reason, and one of great importance for checking the quality of work in the various phases of statistical operations, is the fact that such checking represents the only serious way of obtaining information on the deficiencies in the methods used, on the kind of errors they lead to and the magnitude of these errors. Such information is the only concrete • This problem has already been dealt with by t he same author in Quality of sta tistical data, Rome, PAO.
MAIN USES OF SAMPLING METHODS IN CENSUS WORK
11
b asis on which to b uild plans for an overall improvement of working methods and the removal or reduction of errors. Checking the quality of data requires particular emphasis in connection with census work. Censuses are the major source of statistical informa tion and this is a strong reason for special measures to increase the quality of data. On the other hand, censuses employ a large number of people who are, of necessity, less qualified op the average than the people employed in other types of surveys, with the result that census data are probably more subject to errors than the results of a relatively small sample survey. Sampling methods are, indeed, of primary importance in the whole range of activities connected with the quality problems of census data. Checking quality often involves duplicating work done during the reg ular enumeration. On the grounds of cost, therefore, this work may be only feasible on a sample basis. By definition, sampling methods re duce the volume of operations and thus make quality checks less costly. Moreover, sample quality checks are conducted by highly trained per sonnel who can usually be employed only for small-scale work. Of course, there are other reasons for the use of sampling methods in this field. Among them is the fact that this is the only way of drawing an objective conclusion as to the magnitude of the bias in data, the dif ference in the performance between alternative methods, etc. The application of sampling methods in checking the quality of census data embraces first checking on the quality of the listing, which is intend ed to give an idea of the effect of omissions and duplications on census results. It also gives information on circumstances in which enumera tors fail to follow the instructions regarding the listing operations. Next comes checking the response errors with a view to estimating their magni tude and obtaining information for improvement of data collection tech niques. The third field is the quality control of editing, coding and transfer to punch cards. Here too, sampling methods can be applied to estimate the magnitude of errors committed at this stage of statistical work and to help in designing more efficient data processing plans in future surveys.
2. SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
Advantages of a complete count In considering the question as to whether a sample census or a com plete enumeration census should be taken, the advantages which can be secured through the application of the latter method cannot be disregarded, as they may have decisive value. Some of these advantages are as follows. l.
Data from a complete enumeration census can be tabulated by admin istrative and other area units, whatever their size. In several re cent censuses of population, totals for basic characteristics on the program have been prepared for units as small as blocks of houses or enumeration districts. Since sample surveys do not lend them selves to efficient tabulation for such a program, they do not come into consideration whenever a breakdown of information is required by units of the order of magnitude of villages, groups of villages, communes, and even smaller units.
2. Sample surveys are inefficient methods of ob taining information on rare events, such as areas under some crops and yields thereof, the numb er of persons of advanced age, their distrib ution b y sex, and area of residence; the numb er of persons having a specified physical disab ility and the various types of their distrib ution. If any characteristic of such a nature is included in the census program, the chances of ob taining b y means of sample censuses information that is to be of any use will normally b e very slight. 3. Complete enumeration censuses are very often used as a b asis for improvement of current statistics. For example, current agricul tural statistics in many countries arc b ased on reports from agricul tural extension work staff or other persons interested in or acquainted with the state of agriculture in their respective localities. In order
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
13
to reduce expenditure on current statistics, these reports are results of eye estimates or general impressions on the characteristics in ques tion. Data collected in this way usually contain biases of an un known magnitude. To reduce biases, census data are sometimes used as a guide for reporters. For example, areas and yields in the current year can be estimated by evaluating the change with respect to the census year. If census information is accurate, this procedure might yield better estimates than would be the case with simple eye estimates. In these cases, sample censuses have little, if anything, to offer. G uid ance to reporters is useful if available by relatively small adminis trative units. Estimates for such units obtained from sample cen suses are subject to large sampling errors and thus can hardly be used for the above purpose. 4.
Data of a complete enumeration census can be widely exploited as a basis for various surveys. If in a given country the number of surveys taken each year is even a modest one, the gains achieved by using complete census data in planning surveys might more than outweigh the immediate saving resulting from a sample census.
In countries where the quota method is being used extensively for pub lic opinion surveys, preferences and attitude studies, etc., the application of this method would hardly be possible without data from a complete enumeration census of population tabulated by small administrative units. Such tabulation makes it possible to allocate the total size of the sample to various administrative units and select therefrom a corres ponding number of males and females, distributed over age groups, occu pational classes, etc. The more data available as guidance in allocating the sample, the more accurate are the results likely to be. 1 The same is true of sample surveys. As a first step, census listing can be used as a frame for the selection of the sample. This may prove an important saving, in view of the fact that the preparation of the frame generally consumes a fair part of the budget allotted. Census data can then be used as supplementary information in the process of estimation, which again might represent a considerable increase in the efficiency of 1 C.A. Moser and A. Stuart. An experimental study of quota sampling, Journal of the Royal Statistical Society, Vol. 116, 1953, Pt. 4, p. 349-394; C.A. Moser. Quota sampling, Journal of the Royal Statistical Society, Vol. 115, 1952, p. 411-423.
14
SAMPLING METHODS AND CENSUSES
surveys. Similarly, census data have often been used for the selection of the sample with varying probabilities, and this in some cases leads to a substantial reduction of sampling errors. By taking sample censuses instead of complete enumeration censuses, these advantages are lost. Prerequisites of a complete enumeration census In spite of the advantages of a complete enumeration census, circum stances may be such that it is beyond a country's means. Every complete enumeration census, whatever its program and purpose, involves a certain minimum of facilities: machine tabulation equipment, mapping material in the form of sketches, aerial photographs or cadastral plans to identify the area units and avoid duplications or omissions in the enumeration, and funds for paying enumerators, supervisors, data processing, etc. In addition, it requires sufficiently qualified professional staff for the prep aration of the right plans for all the details in census operations, quali fied enumerators for collecting data in the field, a relatively large number of persons who are able to perform the processing work at a sufficiently high-quality level, etc. It will be useful to divide the prerequisites for a census into two classes, the first involving what we shall call social background, and the second being the direct prerequisites. Social background covers all elements of the general social and technical development of a country which are independent of any particular census, although they affect the possibility of taking censuses. Mapping mate rial can be taken as an illustration. The enumeration procedure is greatly facilitated if census areas or enumeration districts are shown to each enumerator on a map so that boundaries of areas assigned are clearly recognized. Such maps, photographs, plans or whatever else may be used for this purpose, usually exist independently of the census. A similar example is afforded by the availability or otherwise of trans portation facilities. Transport is extremely important in census taking because the timetable of operations and the total census budget are usually in large measure dependent on what transportation facilities are available at the time. Here again, the taking of a census is influenced in a specific way by the resources available in a country independent of the census.
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
15
Another group of census prerequisites which are part of the social background may be found in the level of education of the population concerned. A census of either population or agriculture in a country with a low standard of literacy and general development is usually an extremely difficult undertaking. It is thus beyond question that the social background as a whole, or certain aspects of it, may adverseJy affect the very possibility of taking censuses. There are a number of descriptions based on experience in some countries which show how difficult it can be to take a complete enumeration census if certain elements in the social background are un favorable to it. The following characteristics, for example, are typical of conditions in Africa. 1. Populations not settled in villages but scattered among the hills. 2. The absence of transport facilities for census taking. 3. Great mobility of populations. 4. The scarcity of qualified enumerators. 5. Superstition, in various forms, making for reluctance to disclose census information. 6. The fear of census operations. 7. Illiteracy and ignorance. 8. Lack of interest. 9. Help not forthcoming from the administrative machinery,8 or, if at all, in insufficient measure. The majority of these difficulties would in the same way affect censuses of agriculture which are up against their own specific obstacles, such as ignorance on the part of the agricultural population of the meaning of units and concepts used in censuses, the existence of the large number of units for areas, weights and volume (which also may vary in their II Cf C.J. Martin. The collection of basic demographic data in underdeveloped territories, Bulletin of the International Statistical Institute, Vol. 33, Pt. 3, 1953, p. 1727; C.J. Martin. The East African Population Census, 1948, Planning and Enumera tion, Population Studies, Vol. 3, 1949, p. 303-320. See also: S.W. Dajani. The enumeration of the Beerscheba Beduins in May 1946, Population Studies, Vol. 1, 1949, p. 301-308; J.E. Goldthorpe. Attitudes to the census and vital registration in East Africa, Population Studies, Vol. 6, 1952, p. 163-171; J.R.H. Shaul and C.H.L. My burgh. A sample survey of the African population of Southern Rhodesia, Population Studies, Vol. 2, 1948, p. 339-353; C.H. Harvie. A sampling census in the Sudan, Population Studies, Vol 4, 1950, p. 241-249; Etude dimographique par sondage, Guinee, 1954-55, Paris, Ministere de la France d'outre-mer, 1956; Enquete nutrition - niveau de vie, Subdivision de Bongouanou, 1955-56. Territoire de la Cote-d'Ivoire. The reader interested in the subject might also wish to consult a large number of various statistical reports issued in mimeographed form by the Ministere de la France d'outre mer, Paris.
16
SAMPLING METHODS AND CENSUSES
meaning from one village or area to another), and the use of units devoid of any objective meaning. It is evident that in some countries conditions for enumeration might be extremely difficult. At present, however, sampling methods contri bute toward overcoming these difficulties. Sample censuses imply a reduction of operations and, consequently, simplify operations. A consideration of what we have called the " direct prerequisites " may suggest a similar expedient. Even when the social background is found to be favorable for a census program, there remain a number of facts to be examined before a complete enumeration census can be seen as definitely feasible. First among these is the problem of cost. Are sufficient funds available for the training and salary of field enumerators, field supervisors and for data processing? Does equipment exist for machine tabulation? In the case of certain countries, exchange control difficulties may preclude the purchase of the machinery required. Third is the time factor. Can adequate preparations of the census be completed by the date fixed for the beginning of the enumeration? The question of time may be particularly acute in connection with map ping material, selection of staff for different positions in the census organ ization and finding accommodations for equipment and field offices. Finally, the problem may have to be faced as regards printing schedules, particularly if several languages are involved. Difficulties in taking complete enumeration censuses on the grounds of an unfavorable social background are experienced mostly in develop ing countries. Direct prerequisites, however, are sometimes unsatisfac tory even in countries where censuses have been taken previously and where the social background as such presents no obstacle to taking a new one. A case in point is Jamaica, where a complete enumeration agricultural census was taken in 1942. In 1950, however, in connection with the World Census of Agriculture, the same type of census was not repeated, although there was no difficulty from the point of view of social background. At that time, stress was laid on a number of direct prerequisites, such as funds needed for a complete enumeration census, amount of organizational work, etc. The outcome of these considerations was the decision to take a sample census. 3 Another illustration is the 1954 sample census of livestock in Yugo slavia. Because of great changes in that country's economy during the 3 Cf W.D. Burrowes. Sample survey of production of selected agricultural products, 1950, Jamaica, Department of Agriculture, Bulletin No. 48, New Series, 1952.
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
17
postwar period, holdingwise complete enumeration censuses- of livestock were taken each year from 1949 through 1953. In 1954, the funds were reduced, so that a sample census was taken at a total cost of approx imately one tenth of the average for the previous complete enumeration censuses. In addition to these cases, we may consider that of Poland, where in 1949 a sample census of population 4 was taken because there was no time available for the preparation of a complete census. Here, too, the relevant prerequisites belonging to the social background were not lacking. The last census had been taken on 14 February 1946. This was a complete enumeration census but, after a very short time, the find ings were no longer considered significant. Firstly, the composition of the Polish population had changed considerably. Over 2 million Ger mans and more than 200,000 representatives of other nationalities had left Poland, while many Poles had been repatriated. Up-to-date infor mation was therefore needed on the effect of these changes. Secondly, the 1946 census was tabulated in such a way that only three age groups of the population were used: those under 18 years, from 18 to 59, and 60 and over. Such a classification could not satisfy planning re quirements and it was felt that a classification of the population into five-year age groups was desirable. While there were many reasons for taking a complete enumeration census, it was decided to take a sample census because the information sought was urgent. These examples serve to show that sample censuses, as a method of obtaining the information desired by reducing the volume of operations, may sometimes prove the only possible way of securing data on a census program even in those countries where complete censuses had been taken previously. Unforeseen difficulties and insufficient time for preparation, and other disturbances of a similar nature, leave one no choice but to follow the less burdensome approach made possible by sample censuses. Needs Whether a sample census or a complete enumeration census should be taken in a concrete case may also be determined in the light of a country's need for data. In some cases the need for data on small 4 Stefan Szuk:. The sample census of population in Poland, 1949, Population
Studies, Vol. 4, 1950, p. 112-114.
18
SAMPLING METHODS AND CBNSUSFS
units, or, contrariwise, the need for a complete count is obvious. This is so, for instance, in population censuses when data are used for de termining the representation of various units in parliament or in demar cating constituencies by grouping small units according to the number of people living there. These and similar uses have made a complete enumeration census of population an indispensable tool in organizing social and economic life according to modem patterns. In agricultural censuses, the situation is very different because there is not such an obvious need for data by units as small as villages or even smaller. Nevertheless, all of the more highly developed countries took a complete census around 1950 and prepared totals of census charac teristics by small units which varied in size from one country to an other. The same countries also prepared a number of classifications by · small units. Table 1 shows the smallest units by which the size classifica tion of holdings was prepared in the various countries in the 1950 World Census of Agriculture. It will be seen from this table that the average size is very small in some cases. In the Netherlands, it was approximately 410 holdings, in Canada less than 200, in Norway and Finland less than 1,000 holdings, and so on. TABLE 1. • SMALLFST ADMINISTRATIVE UNITS BY WHICH DATA ON SIZE CLASSIFICATION OF AGRICULTURAL HOLDINGS WERE PUBLISHED OR PREPARED IN SOME COUNI'RIES IN THE 19 50 WORLD CENSUS OF AGRICULTURE
Country
Total number of agricultural holdings 1
Belgium Finland Germany, Fed. Rep. of Ireland Luxembourg Netherlands Norway Portugal England and Wales Canada United States
990913 465 6S5 2 011 9 9 2 379 487 28 389 40979 8 349 528 835 568 427 200 623 091 5 382162
Smallest unit
Canton Commune
Kreis
County Canton Municipality
....
County (special) Subdivisions County
Approximate number of these units in the country
230 557 554 26 12 1000 6 80 278 61 3 250 3100
• Data on the number of holdings in these countries were taken from Report on the 1950 World Census of Agriculture, Vol. I, Census results by countries, Rome, l'AO, 1955.
Obviously, the fact that practically all the developed countries present the results of their agricultural censuses by small administrative units
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
19
does not mean that this presentation is required in every case. The countries which appear in Table 1 can boast a long tradition in agricul tural censuses and in statistics in general, some of them having started regular census taking more than a century ago. In the early days, the only means of arriving at a comprehensive picture and establishing national totals was to take a complete enumeration census. In the course of time the difficulties encountered in the application ofthis method became attenuated; the accumulation, over the years, of experience in this field led the way to improving methods and to clarifying concepts and definitions; co-operation with farmers was likewise developed to such a degree that factors like suspicion, reluctance to disclose infor mation and the inability, through ignorance, to answer questions were practically eliminated. Moreover, in many cases farmers have become aware of the usefulness of censuses and have gradually developed a willingness to co-operate in census operations. With such a background, the complete enumeration census in most countries listed in Table 1 has become largely a matter of tradition and can be repeated relatively easily at any time. The methodological sum mary in the FAO publication Report on the 1950 World Census of Agri culture shows that in these countries local authorities were able to make themselves responsible for a large part of the census work, including selection and appointment of enumerators and supervising field opera tions. Difficulties in census taking are thus reduced considerably. In some countries, like England and Wales, the situation is even more pro pitious, because the collaboration of farmers is developed to such an extent that a postal census is possible every year. The complete enumer ation census may thus be acknowledged as something deeply rooted in the tradition of the developed countries, having been practiced there over a long period and gradually developed into a well-established rou tine; its continuation and renewal from time to time does not create problems of any special difficulty. This being the case, a sample census may only create unnecessary difficulties, whereas adding one more com plete enumeration census to the already well-established list of those taken in the past would not. One consideration remains, which belongs to basic characteristics of the census methodology in the countries we have mentioned, namely, the fact that the existing administrative machinery was allotted a number of different tasks from the start. Normally, district (or county) author ities are charged with the preparation and supervision of census opera-
20
SAMPLING METHODS AND CENSUSES
tions, while authorities at the commune level are usually responsible for appointing enumerators and conducting work in the field. The par ticipation of the authorities at both levels is necessary in order to expedite the work of organizing, reducing costs and adapting operations to the par ticular circumstances. However, as soon as recourse is had to the adminis trative machinery, the work is inevitably divided up by the existing admin istrative units, so that the authority concerned can assume responsibility for the whole work in its territory. And this division does not affect only the collection of data; it penetrates all the subsequent stages of the work and is reflected in the first checks made on the completeness and quality of data, in the concentration of census schedules, in the prepara tion of first results while the material is still in the field, in the files of schedules waiting to be processed or already in the course of processing, because questionnaires are usually edited, punched and verified by using enumeration districts as units. Finally, national totals are arrived at by first producing the subtotals for the smallest administrative units. In this way, the methodology of the complete enumeration censuses in most of the countries mentioned in Table 1 is such that it directs in terest toward small administrative units as the prerequisite for efficient work, which may further explain the small-sized units employed in the presentation of data. Thus this interest may have been, at least in the earlier stages of census history, more the result of the characteristics of the methods used than the consequence of the needs for such pre sentation. The presentation of data, or at least basic census totals by small administrative units, started relatively early in countries with a long tradition in agricultural censuses. On the other hand, the needs which call for such a presentation are of relatively recent origin. Ac cordingly, it might well be that a fine breakdown of census data preceded both the needs for information by small units as well as the knowledge of how to put such information to use. Most of the less developed countries were outside this historical de velopment, and it is therefore necessary to view their problem in a dif ferent light. The question may usefully be put in the following way: Are sample census data sufficient to satisfy the needs of development program ing and general planning? There are indeed several types of planning, and the part played by statistical data in each of them is different. For example, the planning of agricultural production may have the form of establishing national production targets which are later broken down, according to given cri-
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
21
teria, into provincial, district and commune targets. In the ultimate administrative units, these targets are transformed into individual produc tion plans which have to be implemented by agricultural holdings, as in the U.S.S.R. In the latter case, the individual plan targets are imposed on the holdings concerned, which thereafter keep records on a large num ber of their activities in order to check on the progress made in the ful fillment of those obligations. The requirements of this type of planning have very little in common with our problem and are such that more is required than either of these two methods in their usual form can supply. The breaking down of plan targets by administrative units is only possible on the basis of data pre sented by the smallest administrative units. Determining individual tar gets, however, requires data on each individual holding. This is an in terest which lies outside the usual scope of statistics in general and censuses in particular. The collection of data for studies of individual cases and their characteristics is not what statistics is called upon to un dertake. Moreover, in such planning, continuous record keeping on a large number of items is introduced to provide the ground for checking the progress of work and enable its proper management within the hold ings themselves. As kept in the U.S.S.R., these records contain more data than is usually collected in agricultural censuses. Monthly reports, based on these records, supplemented from time to time with items of special information, satisfy the needs of such planning and, in principle, make both sample censuses and complete enumeration censuses un necessary. In other types of planning, there can be no question of obligations im posed on individual holdings. The targets of larger and relatively ho mogeneous areas are set up in accordance with the resources available and then facilities and incentives are established for individual holdings to carry out more work along the lines planned. Here the plans are con sidered to be fulfilled if the area as a whole reaches the target set. The objectives of such planning are constituted by important social issues and problems of large areas. Consequently, the resources made available for improvements are usually provided for in the national budget. The projects dealt with in this case are those which intend to delve deeper into the economic structure or change certain basic aspects of social and economic life, such as attempts to introduce new agricultural practices or crops, measures to improve agricultural technology, projects for more intensive use of fertilizers, irrigation and drainage works, and the con-
22
SAMPLING METHODS AND CENSUSES
struction of highways and .. farm-to-market " roads. All this requires co-ordinated work in a number of different fields owing to the complex interrelationships of modern economy. For example, if action is contem plated for a more intensive use of fertilizers, it necessitates experimentation with a view to determining the gains to be expected from increasing costs of production, then securing funds for the import of fertilizers or the construc tion of a manufacturing plant, a s9lution to the transport problem, adequate extension work where the uses of these fertilizers will be explained, the provision of credit to holders who are not able to buy for cash, and so on. It will easily be seen that a wide range of statistical data is neces sary for studies connected with planning of this nature. However, this does not necessarily mean that data will be required by small adminis trative units. Suitable homogeneous areas often coincide with large ad ministrative units or groups of moderate-sized units like districts or coun ties. If so, a sample census of agriculture can be designed to suit such planning. Interest is not directed here toward individual holdings or any particular small area but toward averages or totals for relatively large areas which can be estimated by using sample censuses. The preparation and execution of development programs at the level of the small administrative unit, such as the commune or village, is not much dependent on statistics. At this level, and for even larger units, phenomena are met with in concrete form, holdings are known individually, and there may be but little interest in an abstract statistical picture. Here planning is probably concerned with repairing a road, building a school, improving veterinary services, organizing milk trans port and the like. The implementation of such projects is based on lim ited local resources and does not necessarily need statistics either for the formulation of programs or for checking the results obtained. Taking sample censuses instead of complete enumeration censuses would not, therefore, give rise to difficulties in small administrative units where the authorities are attempting to improve existing conditions. This consideration of agricultural planning thus leads to the conclusion that the actual use to which the governments of some countries put sta tistical data may not call for a complete enumeration census. The same may also be true with nongovernmental uses.6 However, such a view 6 The reader may wish to consult C. Taeuber. Using a census of agriculture, Bul letin of the International Statistical Institute, Vol. 36. 1958, Pt. 4, p. 251-259. The
paper represents a review of a large number of various uses of census data and it might help in deciding whether a complete enumeration is called for in a concrete case.
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
23
of the problem may fail to take into account future needs and develop ments. At the present time, various changes in economic and social life are taking place so rapidly that it is practically impossible to predict what course events will follow even in the near future. If one may be allowed to make a prediction on present-day trends, development should be expected toward more planning and programing work, and public interest in shaping the life of the country, with statistics automatically becoming a necessity. The bearing of all this on the census problem is obvious. A broader view of the usefulness of censuses could thus secure the facts for future sound decisions and policy making. A census today may contribute substantially toward meeting the vital needs of tomorrow.6 Much more could be said in justification of censuses. Each country is collecting all sorts of documents relating to various aspects of its his tory and no one is likely to suggest that these activities should be abandoned. Collecting documents of various kinds is a well-established practice and is not questioned in any quarter. Censuses, however, do not enjoy such universal approval, yet they, too, are a historical doc ument of primary importance. A partial explanation of this may be found in their relatively recent origin. Censuses of population will show how many people were living in a country at a given point of time, their distribution over the area, how they were distributed in various areas by age, sex, education, school attendance, occupations, and income. In this way, a document containing more facts than a census would be hard to find. Nor should we forget that others are to come after us who will certainly wish to look back and trace various developments in the country as a whole and in its individual regions and units. Complete enumer ation censuses are thus a goal worth pursuing, even if the information they supply is considerably beyond the abilities of the present generation to utilize to the fullest extent.
Quality of data
The experience gained in recent censuses and a large number of other statistical surveys has brought a new problem to the fore of both theory and practice. This is the problem of accuracy or quality of statistical data. e See the paper by P.V. Sukhatme. Statistics for agricultural planning, presented at the 12th Annual Meeting of the Indian Society of Agricultural Statistics, held at Gwalior, India, 1959.
24
SAMPLING METHODS AND CENSUSES
Studies conducted so far have shown that it has an essential bearing on the many decisions to be taken in the selection of suitable methodology. In some countries quality considerations may lead to a decision to take a sample census instead of a complete enumeration. Here it is important to bear in mind from the very beginning that the problem of quality is a very general one, in the sense that errors exist everywhere, in all types of statistics and in all countries, irrespective of how developed they are. For example, the study of the biases present in the 1945 United States Census of Agriculture has shown that 14 percent of farms of the total obtained in the census were omitted in the regular enumeration, while 3 percent were erroneously included in the census results. 7 A similar study conducted under the 1950 census gave the following percentages of the net underenumeration: s Item
Number of farms Land in farms (acres) Crop land harvested (acres) Maize harvested (acres) Wheat harvested (acres) Cotton harvested (acres) Hen eggs sold (dozen)
Percentage of net underenumeration
5.1 2.0 2.1 1.3 1.6 7.9 2.4
These data show that, even in an advanced country, data are subject to errors that cannot be neglected. In developing countries where a statistical service has only recently come into being, where qualified enumerators are very scarce and where, in addition, the population is to a large extent unaware of correct answers to census questions or is reluc tant to disclose the information, the quality problem becomes even more acute. The quality of survey information largely depends upon the education level of the population. This is illustrated in Table 2, which gives data on the percentage of illiterate persons in various republics of Yugoslavia and the percentage of incorrect answers on the question of date of birth 7 A. Ross Eckler and Leon Pritzker. Measuring the accuracy of enumerative surveys,
Bulletin of the International Statistical Institute, Vol. 33, 1951, Pt. 4, p. 7-24. 8 United States Bureau of the Census. U.S. Census of Agriculture, 1950, Vol. 2, General report: statistics hy subjects, Washington, D.C., 1952.
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS
25
as asked in the Census of Population taken in 1953.9 Data on inaccu rate answers were obtained in a special sample check conducted after the census was taken. An answer given to the regular census enumera tor on the question of date of birth was defined as inaccurate if not in agreement with what was shown in the registers, school certificates and other documents which could be used to provide more accurate information. The fact that emerges from this table is that republics with a high percentage of illiterate persons have also a high percentage of inaccurate answers. TABLE
2. -
PERCENTAGE OF ILLITERATE PERSONS AND INACCURATE ANSWERS ON THE "DATE OF BIRTH" QUESTION AS OBTAINED IN VARIOUS REPUBLICS 1
(1953 Census of Population, Yugoslavia)
Republic
Serbia Croatfa Slovenia Bosnia and Herzegovina Macedonia Montenegro 1
I
Illiterate persons
I
Inaccurate answers
. . . . . . . . . . . . . Percentage . . . . . . . . . . . . . 29 26 15 17 2 5 39 45 32 55 25 42
Data in this table are taken from S.S. Zarkovich,
op.
cit.
The problem of quality is equally acute in censuses of agriculture. With holdings producing for their own consumption and the population unaware of units that could be meaningfully used in statistics, it may be a practical impossibility to get sufficiently accurate data within the measures and procedures normally followed in complete enumeration censuses. V.G. Panse has given an example which shows to what extent data can vary if they are obtained by enumerators with differing degrees of acquaintance with local conditions.10 Corresponding data are pre sented in Table 3. They exhibit the average size of operational hold ings as obtained in various parts of India by groups of enumerators, each with a different knowledge of local conditions. It can be seen that the averages obtained are lowest in the first survey and highest in the third. Data in the first case were obtained by the enumerators in a simple 9 S.S. Zarkovich. Population census errors, Belgrade, Federal Statistical Office, 1954 (in Serbo-Croatian). 10 V.G. Panse. Some comments on the objective and method of the 1960 World Census of Agriculture, Bulletin of the International Statistical Institute, Vol. 36, 1958, Pt. 4, p. 222-227. 3
26
SAMPLING METHODS AND CENSUSES
visit to the selected holdings. In the two other surveys the field workers remained in the field for a longer period. They further differed in that the data collected by Survey II were obtained as background information for a different main interest, while in Survey III data were collected by enumerators who were centered round the holdings selected and had a better insight into their characteristics. TABLE
3. -
North North-West East Central West South All-India 1
AVERAGE SIZE OF OPERATIONAL AGRICULTURAL HOLDINGS, IN ACRES, AS DETERMINED IN INDIA IN SOME RECENT SURVEYS 1
Survey I
Survey II
Survey III
3.9 11.9 3.5 10.9 11.1 3.8 6.1
5.3 12.6 4.5 12.2 12.3 4.5 7.5
6.7 20.6 4.8 14.6 14.9 6.2 8.9
Data in this table are taken from V. G. Panse,
op.
cit.
The situation may be even worse where more refined methods of col lecting data are called for. Such cases arise if measurements of areas have to be taken in cross-questioning, in the use of various control questions, etc. For this type of work the regular census enumerators can hardly be trained satisfactorily, particularly in countries where civil servants are not used for collecting data. Table 4 presents the picture of the sort of persons who acted as enumerators in the 1951 Census of Canada. An attempt to train these people to the extent of enabling them all to perform complicated work, such as measurement and cross questioning, would make a census impossible because enumerators in most of these classes lack the requisite general education. In some cases, inaccurate data may again be adequately guarded against by replacing complete enumeration censuses by sample censuses. In this connection, FAO programs for the 1960 and 1970 World Census of Agriculture put emphasis on the need for taking the quality aspect into account and considering the use of sample censuses if a complete census cannot be taken on a satisfactory quality level. In this respect, it may be necessary to go even further and split up the sample census into several surveys which would make up a census if taken as a whole. Mahala-
SAMPLE CENSUS OR A COMPLETE ENUMERATION CENSUS TABLE
4. -
DISTRIBUTION OF TIIE TOTAL NUMBER OF ENUMERATORS BY OCCUPATION 1
(1951 Census of Canada)
Occupation
Homemakers Farmers and stock raisers Students Retired persons Office workers Owners and managers (retail) Salesmen, sales agents Schoolteachers Sales persons Laborers (not on farm) Carpenters All other Not stated Total 1
27
Number 5 368 3 087 1 284 I 260 I 169 593 442 394 244 147 105 1 476 1 281
16 850
This table is taken from Canada. Dominion Bureau of Statistics. Census of Canada,
1951, Administrative Report, Vol. 2, Ottawa. 1955, p. 49.
nobis 11 has proposed a division of the census program into two parts: (a) utilization of land; and (b) number of persons living on the land (clas sified by characteristics, such as age, sex, and employment status) and information on the type of holding (operational, ownership, land tenure, rent, etc.). For items listed under (a), the holding would not be an appropriate unit in some countries, and the information needed might be collected with a higher degree of accuracy if sample surveys were used with plots or fields as units and physical observations on selected units. On the other hand, data on items under (b) can be more easily collected through sample surveys of holdings. The whole census is thus split into two operations, which may be taken at different times, in order to increase the quality, or combined ih one way or another if economy has to be effected.iz 11 P.C. Mahalanobis. Some observations on the World Census of Agriculture, 1960,
Bulletin of the International Statistical Institute, Vol. 36, 1958, Pt. 4, p. 214-221.
12 In connection with this problem the reader may also wish to see the reports by V.G. Panse (op. cit.) and P.V. Sukhatme. The 1960 World Census of Agriculture, Bulletin of the International Statistical Institute, Vol. 36, Pt. 4, 1958, p. 239-250, where similar ideas were expressed as to the role of sampling methods in agricultural statis tics in developing countries.
3. SOME COMMENTS ON SAMPLE CENSUSES
Advantages of sample censuses Sample censuses are carried out by collecting the information for the program from a sample of units representing the population. Since the sample used can be made small as compared with the population, one may also reasonably expect to see a corresponding difference in the amount of work and resources required. But this can be misleading. In point of fact, sample censuses require the same type of facilities as complete enumeration censuses. The difference is only in the extent to which par ticular prerequisites have to be granted and in the volume of operations. In some cases, sample censuses, too, may require long and costly prepa rations. The question therefore arises as to how much is to be gained by replacing a complete enumeration census by a sample census. Certain aspects of this problem are very well known: the amount of work, and the expenditure and difficulties connected with carrying out a sample census depend primarily upon the requirements imposed. Here one has first to consider the precision required or the magnitude of the sampling errors. High precision demands large samples and large sam ples lead to a volume of operations which may not be very different from a complete enumeration. An illustration of the relationships between the size of the sample and the precision required is given in Table 5. The basic data in this table are taken from the 1951 Census of Livestock in Yugoslavia. The total number of agricultural holdings in the country was 2,392,597. The size of the sample required for a given precision is calculated for simplicity on the basis of the theory of simple random sampling. Such a design is not likely to be used in any country and therefore the number of holdings needed for the precision stated in a more realistic design would be somewhat higher. But the basic principle dis closed in this table is a good illustration of the fact that in sampling surveys the size of the sample increases rapidly with the increase of the precision desired. If the precision margin is narrowed from 10 to 1 percent, i.e., if estimates ten times more precise are desired, the size of the sample has to be increased a hundred times.
29
SOME COMMENTS ON SAMPLE CENSUSES
TABLE 5. • ILLUSTRATION OF THE RELATIONSHIPS BETWEEN THE SIZE .OF THE SAMPLE AND THE PRECISION DESIRED 1
(Data refer to the estimation of the number of horses, cattle, sheep and pigs, 1951 Census of Livestock, Yugoslavia)
Characteristics
Horses Cattle Sheep Pigs
I
Average per I Variance holding
0.3618 1.6569 3.2268 1.0520
0.5679 3.3907 51.3406 2.8503
I
I
Number of holdings to be enumerated for the precision desired (expressed in percentages) IO
434 124 493 258
5
1 735 494 1 972 1 027
I
3
4 821 1 372 5 479 2 851
I
-
2
10 846 3 088 12 327 6 416
1
43 385 12 351 49 308 25 663
1 Data for this table are taken from Livestock, 195 I, Belgrade, Federal Statistical Office, 1953, Statistical Bulletin No. 15.
If a small error of the order of, say, 1 percent has to be secured for all the characteristics shown in the table, a sample of 50,000 holdings, or 2 percent, would be needed. With a multistage design, such as would be used here in the actual operation to reduce the cost of travel and of the preparation of a frame, the enumeration would be confined to a certain number of primary units (vil!ages or communes), but the total size of the sample of holdings needed for the same precision would be still larger. In terms of percentages, the resulting sample may become too large in the case of a country with the same variations of these char� acteristics but with a smaller number of holdings. It is clear, therefore, that high precision requirements can lead to a corresponding reduction in the advantages of sample censuses. Compromise on precision requirements is thus an obvious expedient for making sample censuses adaptable to a large range of conditions. It has already been said that for a fixed program complete enumeration censuses require a certain minimum of facilities. If these facilities are not available, the census cannot be taken. Sample censuses are more flexible, provided there is a readiness to accept moderate degrees of precision. The same considerations may be said to apply to the estimation of very low proportions. The size of the sample required for estimation of pro portions with a specified coefficient of variation is shown in Table 6. The figures in this table are again based on the theory of simple random sampling. They show that even for a moderate coefficient of variation of 10 percent, the estimation of the proportion of 0.001 requires as large a sample as 99,000 units, which can be considered beyond any prac tical possibility. In other words, if estimates of small proportions are
30
SAMPLING METHODS AND CENSUSES
needed, a sample census will not be suitable unless a very low precision is acceptable. TABLE
6. • SAMPLE
Proportions
SIZE REQUIRED IN ESTIMATING PROPORTIONS COEFFICIENT OF VARIATION
A SPECIFIED
Sample size rea.uired for the estimation of proportions with specified coefficients of variation (expressed in percentages) 10
0.5 OJ 0.01 0.001
WITH
100 900 9 900 99 900
I
.5 400
3 600 39 600 399 600
I
3
1111 10000 110000 1 110000
I
2
2500 22500 247 500 2 497 500
I
1
10000 90000 990000 9 990000
Another aspect of the sample size problem is the presentation of data. If separate estimates are required by provinces or other administrative divisions, with a somewhat high precision, this leads to a further increase of the sample with all the difficulties entailed. The consequences of a need for providing statistical data by small administrative units on the amount of gains that can be expected by using a sample census as an alternative to a complete enumeration census are obvious· from the basic theory of sampling. Other things being equal, a given precision of sample estimates requires an almost constant num ber of units in the sample, irrespective of the size of the population con cerned. If a country is divided into ten provinces with the same mean and variance in these provinces as in the population as a whole, the size of the sample required to achieve a specified precision in the estimates in each of these ten provinces will be the same as the size of the sam ple needed for an equal precision in estimates for the country as a whole. In other words, an additional requirement to provide equally precise estimates for ten provinces has increased the total size of the sample as much as ten times. It is obvious that it might represent a serious limitation to the hope of facing less difficulties by the use of sample cen suses if the number of administrative units for which separate estimates are desired is high. An illustration of what can be expected if regional estimates are needed will be shown by referring to the previous example from Yugoslavia. The country is composed of six republics, each comprising a number of districts with a number of communes within each district. In cen suses, the totals are presented by communes, while sample surveys have
31
SOME COMMENTS ON SAMPLE CENSUSES
to give estimates by republics and in some cases by groups of districts. In Table 7 it can be seen how the total size of the sample changes by introducing the requirement of fairly precise estimates for each republic. In this table, the same characteristics are set out as in Table 5. In col umns 1 through 6, the size of the sample is presented for each republic and each characteristic separately. Column 7 contains corresponding sample sizes for the country as a whole. Each sample size assumes a precision of 2 percent. In the calculations, simple random sampling was likewise adopted. TABLE 7. • SIZE OF SAMPLE NEEDED TO ESTIMATE THE NUMBER OF HORSES, CATILE, SHEEP AND PIGS BY REPUBLICS WITH A COEFTICIENf OF VARIATION OF 2 PERCENT, AS COMPARED wmf THE SIZE OF THE SAMPLE FOR CORRESPONDING ESTIMATFS FOR YUGOSLAVIA AS A WHOLE
(1951 Census of Livestock, Yugoslavia) Republics tU
Item
tU
:.0
�
tll
1
Horses Cattle Sheep Pigs
12 371 3 592 7 123 5 252
-� 0 ... u
-�Cl >
I
2
9 649 2 679 27 325 6400
��
0
I
Fi3 3
17 220 2 974 33479 3 535
I
�:I: 4
8 0 67 1 935 10 979 11 950
....1:11)
I
s 8 260 3 171 10 114 17 150
tU
i::
'o��
II)
i=
II) .... tU
0
I
� 6
9 608 3 112 7479 6450
o-
"'() 0
II)
0
"'O II) () tU ::':!:
o.c e =II)
0
·a
�� tU bO
·-II>
II)
tU
11>C
�
-oi::: i::·-
N"'
I
ril.£ �
7
10846 3 088 12 327 6439
If 2 percent minimum precision has to be secured for all the charac teristics in each republic, it means that the maximum size of the sam ple presented within each republic has to be used. In other words, the total of 111,883 holdings would be used in the country as a whole to satisfy the precision requirement adopted. If this result is compared with the data in column 7, it is easy to understand the kind of conse quences a demand for precise regional estimates may lead to. It is ob vious that the situation would be even worse in the case of a requirement to provide precise estimates by still smaller administrative units, such as districts or groups of communes. The real meaning of the difficulties in using a sample census when estimates for several and rather small administrative units are desired is best seen in Table 7 by observing the figures for Montenegro. The
32
SAMPLING METHODS AND CENSUSES
total number of holdings in this republic is around 50,000. Here a sample of 9,608 holdings is needed if none of the estimates concerned is to exceed 2 percent. In other words, almost every fifth holding should be included in the sample. In practice, with a more realistic design, requirements may be even more exacting. Taking into account all the aspects of preparations needed for taking the sample survey, it is obvious that little would be gained by a sample census. In other words, an increase in the size of the sample can very easily eliminate the difference between a sample census and a complete enumeration census with respect to the funds and other resources needed. These aspects of sample censuses have been emphasized to point out those elements which can easily make sample censuses almost as complicat ed an undertaking as the complete enumeration census. For this reason, these elements require special consideration in each particular case of preparations for sample censuses. Certain straightforward possibilities do, however, exist for reducing these difficulties. The first is to place less heavy requirements on the desired precision for regional or provin cial estimates. According to the relationships shown in Table 5, a change of the coefficient of variation in Table 7 from 2 to 5 percent for separate estimates in republics would considerably decrease the total size of the sample and make the idea of sample censuses more attractive. If it were equally possible from the point of view of the use of data, the precision of regional estimates could be decreased even more, making for a further reduction in the obstacles confronting sample censuses. The second expedient is to accept less precise estimates for highly var iable characteristics or those representing only small proportions. As can be seen in Table 7, the main difficulty with the size of the sample in. the example is due to the items for sheep and horses, since stock raising for the latter is carried out only in certain provinces and they thus constitute variables with a high variation as compared with the country as a whole. If lower precision can be allowed in respect of highly vari able characteristics, an additional advantage can be secured. In the preceding point we saw that the overall burden of a sample census could be eased if we were to be less exacting with regard to regional estimates; here we make a similar concession in the matter of certain items involved. Further possibilities of reducing the burden of sample censuses are found in the use of more efficient methods of estimation. We mention this point only in passing, as it is basic to the theory of sample surveys and needs no further stress here.
SOME COMMENTS ON SAMPLE CENSUSES
33
Several possibilities are thus open for making sample censuses consider ably less difficult than the alternative complete enumeration and feasible in a number of cases where a complete count for one reason or an other cannot be undertaken. Sampling units If a decision is taken in favor of a sample census for any of the rea sons referred to earlier, and if, in addition to this, the social background prerequisites nevertheless permit a complete enumeration census to be taken easily, a number of problems might arise in censuses of agriculture in connection with sampling units. It may be best to use the holding as the sampling unit and later adopt any design based on this unit, provided it satisfies the usual requirements, such as administrative convenience, precision and economy. If large biases in basic items on the census pro gram are not expected, the use of holdings as sampling units will pro vide an extremely useful picture of the agriculture. It would also facil itate the preparation of useful classifications and cross-classifications. A single survey, taken at a convenient period of the year, may be suffi cient to yield the necessary estimates. In such surveys the information is collected by interviewing the individual holders. The approach is relatively easy because it does not require special knowledge on the part of enumerators. In addition, this method is generally considered to be relatively inexpensive because all the information needed is obtained in a comparatively short interview. If the conditions are such that holdings can be used as sampling units in the sense described here, a sample census of agriculture would be a rather simple undertaking. Such conditions can be expected in countries with a relatively long tradition in statistics. In sample censuses taken under more difficult conditions, the use of holdings as sampling units might lead to some dif ficulties. The first drawback in adopting the holding as the sampling unit in such a situation is that data obtained by interviewing the holders may contain errors due to several causes, such as ignorance on the part of the agri cultural population of the right answer to census questions, or the fear of consequences which, it is imagined, will be more detrimental if correct information is disclosed. In an interview survey, a large number of psychological factors enter into the picture, and since their effects can-
34
SAMPLING METHODS AND CENSUSES
not easily be brought under control it is essential to examine the sit uation of the country from the point of view of the quality of data be fore deciding to rely on holdings as the source of information. If this point is neglected, there will be a risk of inaccuracy in the final results. The following is an important aspect of the same basic difficulty. It concerns the fragmentation of the total area of holdings, which assumes considerable proportions in some parts of the world. As a result, a number of fields owned or operated by specified holdings are located in administrative units other than their headquarters. Experience has made it sufficiently clear that such fields are more frequently forgotten in holders' statements than those within the administrative units to which the head quarters of the reporting holding belongs. The fields located at a further distance are sometimes operated on a different basis from the others, such as sharecropping, and so the respondents find an excuse to neglect them in their statements even if specifically requested. Nor does the exist ence of these fields lend itself to easy checking. There is a danger, therefore, that sample censuses using holdings as sampling units result perhaps in serious underestimates of the total areas. There is a further difficulty. Agricultural censuses and, consequently, sample censuses of agriculture which are based on holdings are, in most cases, taken in such a way that urban areas and the holdings belonging to them are left out of the census. Extending census operations over urban areas increases the cost out of an proportion to the additional in formation obtained. On the other hand, by restricting the enumeration to rural areas and using a holdingwise approach, there is a danger of omitting some areas belonging to holders living in cities and outside the census area. This is particularly likely to happen with orchards, gardens and areas used for similar purposes, where the land does not have on it a house in which the people working it might live, so that an enumer ator canvassing his enumeration district has no idea that it might be a separate holding. Several means were used in the past to bear on this danger of under enumeration. For example, rural areas may be canvassed by the census of agriculture, while the census of population is used for putting certain basic agricultural census questions to the households or similar units in urban areas. Also, listing of households for the purposes of the census of population might be broadened to include some questions which per mit separating households that can be considered as agricultural holdings. These are enumerated afterward in the subsequent census of agriculture.
SOME COMMENTS ON SAMPLE CENSUSES
35
This solution is costly. In addition, it is connected with an unknown quality of information about agricultural activities of urban households. If there is no paral1el census of population, a small sample survey might be considered in urban areas. Although this could be a means of esti mating the extent of agricultural activities of urban households, it re presents a costly approach and considerable effort. Our next problem may be said to derive from the frame to be used in the sample census. In some countries where a cadastral survey has never been taken and where there is no properly established administrative division, a list of vi11ages may be the only frame available. If the holding has to be used as the elementary unit, these villages have then to be thought of as groups of holdings. In the case of condensed settlements where the houses are close to each other, there will be no difficulty in selecting villages as primary sampling units; each village will be clearly distinguished as a separate unit and for each holding it will be possible to identify unambiguously to what primary unit it belongs. In the case of isolated holdings, spread over a large area and showing no tendency toward grouping, a viIJagc as a unit may be only very vaguely defined. If a village is selected in such a case, it will not be known precisely what holdings are involved, and the final list of holdings for purposes of select ing the second-stage sample will be correspondingly influenced by the enumerator's personal judgment and the opinion of his informants. Needless to say, this too opens the way to errors and makes it necessary to consider the situation from this further standpoint before the holding is finally adopted as the unit. Part of the difficulties with holdings as sampling units are removed if principles of area sampling are applied. In other words, in the first stage of selection, some area units are selected which have identifiable boundaries in the field. In these units, holdings are listed afterward and used as the second-stage sampling units. Instead of holdings, other units might be used in some cases and for some purposes. Such units are fields in area or yield surveys. Area sampling has a number of disadvantages. First of all, if this method is applied, a country's agriculture can no longer be viewed in its holdingwise perspective. For example, tables showing utilization of land against size of holdings wi!l not be possible. We must therefore weigh what is sacrificed in this approach against the advantages it otherwise offers. Some typical difficulties of area sampling might be mentioned here. The application of this approach is greatly facilitated if adequate mapping
36
SAMPLING METHODS AND CENSUSES
material is available for purposes of the delineation of units. In the absence of this, the preparation of sketches in the field will be necessary, which is an expensive and time-consuming operation. Difficulties of this nature in securing a sound frame for the selection of the sample are likely to be particularly pronounced in countries where no cadastral plans or maps are available. If villages are taken as primary sampling units, it has at the same time to be understood that villages are now defined in terms of area. In this case, an identification of the borders of the selected villages may be much more complicated than with hoJdingwise surveys. Precautionary measures against a confusion of boundaries, such as field visits and descriptions of borderlines between various units, are usually expensive. This means that a recourse to area sampling does not simply remove the difficulties of holdingwise surveys. It often hap pens that the elimination of one difficulty only creates another. Area sampling in some cases requires better qualified and more conscien tious personnel than do holdingwise surveys. Traveling from one selected area to another, taking measurements of fields and plots under various crops, cutting crops, and threshing and weighing them is monotonous work requiring considerable patience. Interviewing holders is a far more interesting task and the attention is kept vivid by continuous changes of faces, homes and surroundings. The use of inadequately trained and insufficiently conscientious personnel in the former case may easily lead to disastrous consequences, while the influence of the personnel in the case of holdingwise surveys is probably kept within much narrower limits by the existence of the questionnaire and by the presence of re spondents. Finally, there is the question of cost. Area sampling for objective data on certain basic items in agricultural statistics is a relatively expensive method. The work is usually done by teams. Traveling from one unit to another (and these are often located far from main lines of communi cation) requires the availability of good cars and drivers acquainted with the terrain. The execution of the work on the spot is long and therefore costly. If the work is to be done properly, it may require several visits to the same spot. For example, crop cutting is performed when the crop is mature and before it is harvested. In addition, this work also requires certain instruments, such as balances, meta11ic tapes for measuring, etc. In concluding this section, emphasis is again laid on the methodological flexibility of sample censuses. If the study of a country's conditions
SOME COMMENTS ON SAMPLE CENSUSES
37
shows that the holding must be definitely abandoned as a sampling unit in collecting data on yields, areas and related items, there is no difficulty at all, with sample censuses, in splitting the whole census operation into two parts, one holdingwise and the other by area survey. Furthermore, there is nothing against taking these two surveys at separate times: the winter season might be preferable for the holding surveys, while the period of crop maturity is selected for area and yield statistics. Such an approach may also derive support from the fact that designs needed for efficient work along these two lines may be very different. Sample censuses are thus open to an extremely wide variety of combinations so that samplers have abundant means at their disposal. Size of the sample 1
The problem of size of sample is much more complex than in cases discussed in the literature on the theory of sampling, where usually only one variable is assumed. On the basis of the known characteristics of the distribution of this variable, the size of the sample is arrived at by reducing to a minimum the sampling error for a fixed cost, or vice versa. In sample censuses, a large number of variables come into play with distributions that differ in varying degree one from the other. If, in this situation, the same procedure for determining the size of the sample is applied for each item on the program, the result will be that the sample sizes obtained for a given precision will vary over a wide range. Obviously, calculation of the size of the sample for all the items on the census program is a waste of effort. The practical steps remaining to be taken in such cases are to select a certain number of items that are considered as basic to the use of data required and to calculate the size of the sample for these items. If they are nearly the same or vary to only a moderate extent, any convenient number of the order of magnitude found will indicate the size required. The remaining items will be es timated with precisions lower or higher than those items used as standards (i.e., those having the required degree of precision). However, if the sizes of samples determined for basic items do vary considerably, then the largest size has to be adopted if one cannot accept less precise estimates of items leading to large samples. 1 This section is based on S.S. Zarkovich. Some problems of sampling work in under developed countries, a paper presented at the Brussels session of the International Sta
tistical Institute, 1958.
38
SAMPLING METHODS AND CENSUSES
It is possible to introduce further refinements into the determination of the size of the sample, 2 but these need not be discussed here because they are either connected with special types of designs or presuppose the existence of particular facilities. In what was said above, the size of the sample has been taken as a function of the precision needed. But what degree of precision is, in fact, needed in sample censuses? It is easy to see that this problem has a quite different character in sample censuses than in sample surveys intended to produce information for a precisely defined purpose. Let us suppose that a survey is planned to gather information on a number of families in a certain area who intend to buy a new type of product in a specified period of time. The aim might be to use this information for deciding whether production should be started locally to meet the demand. If the precision obtained is low, a large risk might be attached to an investment in a business which would not yield sufficient profit. If the size of the sample is enlarged in order to reduce such a risk, the information will be costly. The problem of the precision to be imposed is thus the problem of striking a compromise between the risk and the cost of information. 3 In the case we have envisaged, the data required have a clear operational character. In sample censuses this is not so. Data are collected for general information; they have to serve government agencies, research purposes, private interests and whoever else might wish to be informed on the field in question. In such circumstances, it becomes impossible to use an operational approach in the determination of the size of the sample. Even if we assume that data on the sample census will be used for decision taking, it is not known in advance with what items these decisions will be concerned, and what is the kind of risk involved. Thus, the prerequisites for an operational approach of the type discussed above to the determination of the size of the sample are not to be found in sample censuses. At this point we come to the need for a number of metastatistical con siderations. The desired precision for basic items on the program has 2 Cfr. T. Dalenius. Sampling in Sweden, Stockholm, Almquist and Wiksell, 1957, Chapter 9, p. 195-211. a For further details on this approach in determining the size of the sample by taking into account the possible consequences in the use of data, see W. Edwards Deming. Some theory of sampling, New York, Wiley, 1950, and in the literature on quality control. See also P. Thionet. Decisions apropos de sondages. Lecture delivered at the Seminaire de recherche operationnelle de l'Institut de statistique de l'Universite de Paris, June 1955.
SOME COMMENTS ON SAMPLE CENSUSES
39
to be determined after a general consideration of potential users of data and the nature of their interest. In a highly developed country where so many studies are based on statistical data and where a number of public organizations and government agencies are continuously revising their policies and plans in the light of the most recent statistical infor mation, one should say that, in principle, a higher order of precision is needed. On the other hand, in countries where data showing the true state of affairs are mostly needed for giving an approximate idea of the general order of magnitude of basic .totals, proportions or averages, pre cision requirements would appear to be less exacting. It is difficult to give any practical guidance in this respect or to lay down precision rules for given circumstances. Many elements have to be taken into account, such as prospective needs for data, basic lines of future economic develop ment, etc. An ability to integrate factors of this nature and translate them into a decision as to the precision to be required in a sample census is a matter of practical experience and acquaintance with many aspects of the life of a country. Here statisticians have to rely on the help and advice of experienced people from many walks of life. One should add, however, that this approach to the problem of the size of the sample as a function of precision is not of universal application. The approach starts from the assumption that any number of units can be used as long as they correspond to the precision needed. If a multi stage design is adopted, this is equivalent to an assumption that we are in a position to include in the sample any number of units at all stages of selection. Logically, therefore, we are also assuming that any number of sufficiently trained enumerators can be hired, that maps can be pre pared, transport secured, etc. It is true that all these assumptions may hold in countries where the social background affords a wealth of facilities for conducting sample surveys. In using other countries, however, the size of the sample is likely to be determined in advance by the facilities available. In such cases, the sample size cannot result from the precision previously agreed upon and the characteristics of the distributions involved. As a matter of fact, it is a result of an inspection of the conditions available. This is of frequent occurrence where the social background is unfavorable, as the following illustration shows. In the FAO 1950 World Census of Agriculture, Basutoland decided to take a sample census because the fundamental prerequisites for a complete census were not granted. Basically, the census was split into two parts:
40
SAMPLING MEtHODS AND CENSUSES
the "main survey " (the aim of which was to collect data on agricul tural population, machines, tenure of land and areas under various crops) and the "crop survey" (which was designed to give data on yield for crops in areas dealt with in the main survey). A two-stage sample was used, with a number of area units selected in the first stage and holdings in the second. It was found that the respondents were incapable of giv ing accurate information either on areas or yields and so measurement was found to be necessary. When, finally, it was clear what work the enumerators had to do, only ten Europeans were available as sufficiently qualified enumerators. It was decided to adopt, in principle, as large a sample as possible, but thBse ten Europeans, as leaders of the enumer ation teams, were a factor of primary importance in determining the ul timate size of the sample. Measurement of areas planned in the main survey could, in fact, be carried out after plowing up to the late stages of crop growth, i.e., over a period of seven months. In addition, by taking into account the average time needed to perform the work planned in individual holdings (interviewing and measuring areas), the volume was determined of the work to be done by these ten teams. All that remained was to distribute this work over the primary and secondary units to get the final size of the sample. Here it was arranged that the sampling fraction of the primary units be 1 /12 and that of the second ary ones 1/5.4 In other words, the size of the sample was fixed without being in any way a function of a precision established as a goal in advance. In this particular case, there was some freedom in maneuvering, as it was possible to vary the sampling fraction of primary units and adjust the sampling fraction of secondary units accordingly. Cases, however, sometimes appear in larger countries with difficult transport conditions, of one enumerator or a team being unable to visit more than one or two primary units in a given period of time. If the number of enumer ators is fixed, as it was in Basutoland, and the work must be completed in a short time, as is the case in crop-cutting surveys, the possibility of varying the second-stage sampling fraction may be similarly circumscribed, and the final precision of estimates will then be even more the result of the conditions found. It will be seen from these examples that the problem of the size of the sample may vary widely in character depending upon the general condi4 A.J.A. Douglas and R.K. Tennant. Basutoland Agricultural Survey, 1949-50, Maseru, Government of Basutoland, 1952.
SOME COMMENTS ON SAMPLE CENSUSES
41
tions of the work. In some cases the size of the sample will be the result of the precision desired. Here, full use of the theory will come into play. In other cases the sample size will be determined by the conditions. Here the statistician's main activity will probably concen trate on securing certain basic facilities, such as additional cars (to in crease the number of primary sampling units), on attempting to simplify the procedure or cut down the program so that the field staff can cover more units, etc. In the first case, the size of the sample will be part of the broader problem of how to satisfy the existing needs by selecting the most rational method among a number of alternatives. At the other extreme, it will be determined by facilities at hand once all avenues have been explored to extract the maximum from the resources available. It might be useful to be fully aware of these extremes when the problem of sample size is encountered in practice. General strategy in preparing sample censuses
It is clear from previous considerations that the aspects of work to be done in sample censuses are many and varied. Enough evidence has been given to show that all of these aspects are to some degree intercon nected. The work on sample censuses thus represents an integrated whole. This is often forgotten by administrators responsible for basic decisions in sample censuses as well as by some statisticians responsible for advis ing administrators on the formulation of general policy. Some obser vations may therefore be useful on general strategy in preparing sample censuses. In their basic decisions, the administrators very often subscribe to a philosophy which seems to reduce itself to the assumption that sample censuses are based on a relatively small number of units and therefore do not require long and systematic preparation. In other words, when ever the decision is taken to get data by means of sample censuses there is a tendency to think that enough time is always available. The practical consequences of such an attitude are obvious. If a de cision is taken late as regards the census day, or if the preparations have been neglected as a result of this attitude, there will be insufficient time for a systematic study of the various aspects of the work. A rather com mon characteristic of the designs accepted under such conditions is the fact that they deviate to a greater or lesser degree from the require4
42
SAMPLING METHODS AND CENSUSES
ments of the theory. For instance, the sample of agricultural holdings is sometimes selected from old lists which have not been brought up to date; area units are decided on, yet their borders cannot always be identified; in the case of non-response or inaccessibility, the reserve units are used; in the process of estimation, approximate formulas are used without the requisite checking of the magnitude of biases, and so on. Needless to say, the results of such work, when presented to the users of statistical data, become more a matter of hoping for the best than a product of rigorous scientific procedure. Furthermore, such work is characterized by poor efficiency. Highly efficient work is the result of a comparative study of a number of al ternative designs where many factors may be varied, such as the number of stages of selection, the size of units in the different stages, methods of selection and the like. Efficiency considerations may point to the need to concentrate efforts in the field and procure additional transport or train additional field workers, so that more units of a specified stage of se lection can be included in the sample. All the speculations as to what can be done in this respect need time, data and, in all likelihood, a cer tain amount of field testing. In the absence of this kind of work, the resulting efficiency is likely to be low, involving a waste of energy and money. What is worse is the fact that poor efficiency very often goes hand in hand with low precision in the estimates obtained. Without proper studies and experiments, it is not certain in advance what methods will yield sufficiently accurate data, what items are partic ularly liable to errors and what precautions have to be taken to eliminate them. In some circumstances, however, inadequate preparations are unavoidable as, for instance, when prospects for taking a census all along are negligible and unexpectedly funds become available to finance a sam ple census. If, in addition, an early census day has to be accepted for some reason, a. survey of the kind we have just deplored will perhaps be all that is feasible, and the inadequate preparatory work will be excusable on the grounds that some data, although far from being fully satisfactory, are better than none at all. The fact is, however, that most sample censuses are far from being such a simple procedure. To see this clearly, it is well to remember the successive phases of the work in the development of the final plan for a sample census. In studying a country's needs, contacts have to be estab lished with the users of census data with a view to determining the program. In reviewing the resources available, such as funds, transport,
SOME COMMENTS ON SAMPLE CENSUSES 43 ----------------------------------
mapping material, and enumerators, many complications can arise. Some of the items listed here may require further study and field visits. If maps are available, what kind and size of units can be based on them? Can the borders of all the administrative units be identified in the field in every case? What type of work can the enumerators available be used for? Attempts to answer these questions may well involve a revi sion of the program. Furthermore, attention must often be directed toward future needs. If the survey under preparation is intended for exploitation in the country's subsequent statistical activity, it will certainly have different aspects than if prepared for use on a single occasion. In the former case, it might be very reasonable to allot funds for the prepa ration of sketches, description of borders of the enumeration districts and other units, studying the components of the total variation and determining the optimum size of various units, equalizing strata, and train ing the field personnel for more complex duties. Again, the need for further studies and improvements may also lead to an acceptance of a design which is not the best from the efficiency point of view but is con venient as a source of data for various analyses which are to follow the sample census. What units should be used among those which are theo retically possible? It has been shown earlier that this question may lead to lengthy studies which often cannot be completed within the framework of the preparations for a sample census, no matter how early the work starts. It is clear from these and earlier remarks that the preparations for sample censuses represent a serious problem, particularly in statistically less developed countries, where the work has to start from the beginning. By way of illustration, let us take an example from the work of FAO. In 1957, an FAO expert went to Iran to assist in the preparations for the 1960 World Census of Agriculture. It was evident, however, on sev eral counts, that a complete enumeration was out of the question. Work was then concentrated on exploring the possibilities of taking a sample census at least over a large area which accounts for a major part of agri cultural production. The only frame available was the list of villages. If interviewing had then to be resorted to in collecting holdingwise data, it would involve the preparation of lists of holdings in the selected villages. This was found to be an easy and inexpensive task which made it possible to assign to each enumerator several primary units. This approach was convenient, therefore, from the cost point of view. There were several reasons, however, for questioning the applicability of this method, such
44
SAMPLING METHODS AND CENSUSES
as the suspicion among holders as to the implication of statistics, their reluctance to disclose accurate information, and the ignorance on the part of the agricultural population of units of area, volume and weight. For a number of such reasons, it was thought that area sampling might be preferable for at least area and yield data. Since no maps what soever were available to delineate sampling units, the villages had to be considered as the primary units. But here a difficulty arose. The use of villages in an area approach was found difficult because their boundaries were not clearly defined. In this way, if area sampling was to be consid ered a practical method, it was necessary to check first whether the list of villages could be used as a list of area units. If so, could sketches of the selected villages be prepared at reasonable cost to delineate iden tifiable second-stage units? In a situation of this nature a considerable amount of experimental work in the field is needed before any decision can be taken as to the design of the sample census. An objective answer to these questions was sought by taking a pilot survey in a relatively small and homo geneous region. After the results of this survey were analyzed and the first guide obtained for a more concrete consideration of the techniques that could be applied, a plan was drawn up to extend the experimenta tion to other regions where conditions are different, so that the use of somewhat changed techniques might be required. Under the plan as prepared, several years were reserved for this experimentation, so that it would be possible to take a sample census of agriculture around 1960, using methods suitable to the conditions of the country. Clearly, if present-day mastery of the theory and practice of sample surveys is to be used to the fullest advantage in a sample census, system atic action must be initiated as early as possible. Only under these conditions will budgeting be planned as it should be, and the various studies distributed properly over time and space, the necessary personnel gradually trained, etc. And in this respect there is no basic difference between complete enumeration censuses and sample censuses.
4. AUXILIARY SAMPLE CENSUSES
Definition A useful approach in census work might in some cases consist in tak ing a complete enumeration census of a part of the population judged to be more important and a sample census of the remainder. Such a sample census will be called in our terminology an "auxiliary sample census." 1 Its primary aim is to extend the information made avail able through complete enumeration. The justification for this procedure is found in the need to collect data on those units that may have to be left out of the complete enumeration with the aim of achieving a re duction in cost and effort. This continuation of techniques can be usefully applied in censuses of agriculture if information is required on those holdings which lie outside the accepted definition of an agricultural holding and, for that reason, are left out of the census. In all agricultural censuses there is a limit to the area or the value of production below which a holding or a farm will be omitted from the enumeration. This is to eliminate waste of time and money on enumeration and processing the questionnaires for the units that contribute but little to the totals of most census characteristics. In the 1951 Census of Agriculture in Canada, only those holdings were enu merated which had an area of 3 acres or more, or which, if less, had a total value of their 1950 agricultural product amounting to $200.00 or more. The holdings below 1 acre were left out of the enumeration, irrespective of the value of their production. Needless to say, these limits vary from one country to another according to the prevailing type of agriculture and the distribution of farms by size in terms of production. Holdings which lie below the limits prescribed can be disregarded only if it is found that the contribution of such holdings to census totals 1 All sampling methods in census work are referred to in this book by the purpose they serve and not by the name of the technique applied in the terminology of the theory of sample surveys. The combination of a complete enumeration census and a sample census is a special case of stratified sampling. However, the purpose of the sample census in this case is to give information auxiliary to what is obtained by complete enumeration.
46
SAMPLING METHODS AND CENSUSES
for all the important census characteristics is really negligible. In some cases, however, it may not be possible to disregard them altogether as, for example, where the number of these holdings is large. Although they are individually small, their totals may represent a significant quantity. Elsewhere, such small holdings might have important social and political implications because the number of people living there may be consider able. Leaving such holdings entirely out of the enumeration means de priving the public of the necessary information for studying this part of the population and, conceivably, planning for the improvement of their situation. Securing this information by means of a complete enumeration census would necessitate changing the definition of the holding in such a way that practically all small holdings are included. If they are very numerous, this solution would only augment the number of units to be handled in the census, with a corresponding increase of cost. In many respects, this cost may be simply a waste of money. The answers of such hold ings to most census items will be '' nil " and this is the information that absorbs a good part of the total cost for processing and interviewing. In this situation, a more rational approach would be to confine the complete enumeration census to holdings which constitute a substantial part of the country's agricultural production and take an auxiliary sam ple census of the small holdings. A particular justification for doing so may be suggested by the fact that where such holdings are concerned, we are normally interested in data on certain items only, such as the num ber of persons, their age and sex distribution, area operated, and so on. Applying a large questionnaire and obtaining only a limited amount of answers different from " nil " would be of very doubtful value. A notable application of the technique is found in cases where it is necessary to reduce costs. For example, a complete enumeration could be taken of large holdings such as commercial farms or state co-operatives, and an auxiliary sample census of the smaller holdings. Needless to say, the technique is particularly efficient in reducing census costs if the part of the population being sampled is large compared with the part enumerated completely. The type of size distribution which leads to an efficient application of the auxiliary sample censuses is presented in Figure 1. The fu]l line in this figure represents the frequency curve of the size of holdings expressed in terms of the total area. It shows a large number of small holdings and a relatively smal1 number of larger holdings. The dotted line represents the cumulative total areas. If
47
AUXILIARY SAMPLE CENSUSES
we now assume that holdings of the size larger than Tare enumerated completely and those smaller than T only sampled, it will be seen from the figure that the volume of the complete enumeration would be relative ly small as compared to the size of operations which would be necessary had all the holdings larger than L been enumerated completely. It will further be seen that the complete enumeration of holdings larger than T covers a substantial part of the total area. Only the part left of T under the cumulative curve remains to be estimated by the auxiliary sample census.
• ··1' .· .. .·
.... ...
.. •
Cl) (!)
z
0 ..J 0 :J: LL. 0
..
er w m � =, z
L
.. . . . . . . · .. ·
.. . . .
. .. .. ..
•
•
T SIZE OF HOLDINGS
1. - Size distribution of agricultural holdings leading to a substantial saving with auxiliary sample censuses. FIGURE
The estimating procedure in the case of the combination of a complete enumeration census and an auxiliary sample census is simple. The pop ulation to be enumerated will be taken as composed of two strata, one enumerated completely and the other sampled. The variable under con sideration is x, which can be the value of any characteristic on the census program. In the first stratum the complete enumeration gives the total X1 • The total of the second stratum is estimated and added to the total for
48
SAMPLING METHODS AND CENSUSES
the first stratum to arrive at an estimate of the total for the two strata cembined. If the latter total is designated by X and its estimate by X', the assumption of simple random sampling leads to (4.1) where N2 stands for the number of units in the sampled stratum, n2 for the size of the sample used, X2; for the value of the characteristic of the i-th unit in this stratum. With the selection of units with replacement, the variance of X' is (4.2) where cr! stands for the variance of the variable x in the second stratum. Needless to say, the estimate (4.1) can be simplified for computation if the sampling fraction is fixed at some convenient number, such as 10 or 20 percent. In that case, sample totals have to be multiplied by 10 or 5 respectively and added to X1 • In addition, a number of more efficient methods of estimation can be used instead of (4.1). Examples of such cases will be presented in subsequent chapters. Illustrations
An example of the use of the auxiliary sample census is found in the 1952 Census of Agriculture in Ceylon, 2 which had to be taken under pressure of shortage of funds. This was the first restriction in plannin� the census. In addition, data were required on the geographic distribu tion of large farms, called estates because these grow primarily commer cial crops, which are an important source of the country's foreign cur rency revenue. In other words, a complete enumeration census of es tates was needed, while the totals for the remaining small holdings could be estimated from a sample census to reduce the total cost. The combination of the two methods was planned by making use of lists of estates which were available centrally. In principle, in these lists all farms of 20 acres 3 and over were included. Before they were 2 All the data presented here are taken from Census of Agriculture, 1952, Part 4, Department of Census and Statistics, Ceylon Government Press, 1956. s 1 acre = 0.4047 hectare.
49
AUXILIARY SAMPLE CENSUSES
used, however, they were revised in the field with a view to bringing them up to date. After this revision, a complete enumeration census was taken by means of a questionnaire sent to each estate on the list. Only a small number of returns were received within the 21 days originally planned for receipt of the questionnaires, so that several reminders were necessary in order to increase the response rate. For the purpose of the auxiliary sample census the country was divided into clusters, and a sample of 504 such clusters was selected. Within the selected clusters, all the holdings other than estates were enumerated where the holders were resident. The tabulation was then prepared separately for each of these two parts of the population. In order to explain the potential gains in applying this technique, some details will be given about the two parts of the population. According to the 1946 Census of Agriculture, the total area under cultivation in Ceylon was 3,502,564 acres, of which 1,343,697 acres belonged to 11,288 estates and 1,866,404 acres to small holdings. The total area under four prin cipal crops was divided among the two classes of holdings as in Table 8. The average size of the sma11 holdings was 2.49 acres but the publica tion quoted does not show the total number. However, if the average size of these holdings is 2.49 acres and the total area cultivated is 1,866,404 acres, the total number of holdings should be about 747,000. Their size distribution is presented in Table 9. TABLE
8. •
DISTRIBUTION OF AREA UNDER FOUR MAIN CROPS ON ESTATES ANO SMALL HOLDINGS IN CEYLON
(1952 Census of Agriculture)
Crop
Tea Rubber Coconut Paddy
Area cultivated Estates
Small holdings
. . . . . . . . . . . . . . . . . . . . . Acres . . . . . . . . . . . . . . . . . . . . . . 456 133 111 155 337 374 318 127 238 744 832 198 28 871 878 560
If all these holdings were enumerated, we would have an example of a census in which the definition of the holdings was such that even the smallest holdings were enumerated. What this means in terms of the volume of census operations can be seen from Table 9. Interest in
50
SAMPLING METHODS AND CENSUSES TABLE 9.
- DISTRffiUTION OF SMALL HOLDINGS BY SIZE (1952 Census of Agric ulture, Ceylon)
Size
Number of small holdings
. . . . . . . . . Acres . . . . . . . . .
. . . . . . . Percentage . . . . . . . . . . . . . . Thousand . . . . . . .
Less than 0.5 0.5-1 1-2.5 2.5-5 5-10
1()..20 20 and over Total
20.25 18.55
151
16.65
124
0.82
22 6 747
32.44
8.37 2.92
100.00
139 242 63
holdings of less than 0.5 acre contributed 151,000 units to the census. If the definition were such that holdings of below 1 acre in size were disregarded, there would be 290,000 units less. If the cost of includ ing a unit in the census is the same irrespective of its size, their inclu sion makes a tremendous difference in many respects and particularly in terms of cost. In other words, a decision to place the lower limit of the size of the holding at 1 acre means that the census as a whole would be considerably easier to conduct and, at the same time, probably not more than 150,000 acres, or little more than 4 percent of the total cultivated area would be omitted. In some cases it might be worthwhile to devote serious thought to the gains that could thus be realized through the use of auxiliary sample censuses, which are designed to concentrate complete enumeration on large units without losing data on small ones. Another illustration refers to the 1950 Census of Agriculture in Brazil. Table IO shows the country's position as regards the feasibility of an auxi liary sample census. Columns 2 and 3 of this table show the contribu tions to the country's totals for specified items of agricultural holdings with a total area of less than 10 hectares. Columns 4 and 5 present the same picture for holdings of less than 5 hectares. 4 Holdings smaller than 10 hectares contribute relatively little for most items except the farm population. If the country's needs for data are such that they could be satisfied by taking a complete enumeration census of holdings of 10 hectares and over, and a sample census of smaller holdings, the saving achieved might be substantial as regards both the cost and the amount of organization and effort required. The amount of saving is 4 1 hectare = 2.471 acres.
51
AUXILIARY SAMPLE CENSUSES
TA BLE 10. • RELA TIVE CONTRIBUTIONS OF SMALL HOLDINGS TO TOTA LS FOR SELECTED ITEMS TA BULA TED FROM THE 1950 CENSUS OF AGRICULTURE 1 Holdings with total area of less than: 5 hectares
10 hectares Item
Number of holdings
Total of all holdings
Absolute value
Percentage of total for all holdings
Absolute value
Percentage of total for all holdings
2 064 642
710 934
34.4
458 676
22.2
232 211 106 4 402 426 14 692 631 92 659 983 14 973 060
3 025 372 203 118 1 500 665 339 301 113 392
4.6 10.2 0.4 0.8
1.3
1 170 569 76 892 699 352 90 181 29 792
0.5 1.7 4.8 0.1 0.2
155 625 221
8 909 537
5.7
4 191 435
2.7
9 575 277 6 943 916 2 807 361 3 729 244 1 245 557
2 193 547 1 449 239 774 308 390 111 47 714
22.5 20.9 26.5 10.5 3.8
1 327 002 866 882 460,120 213 020 18 213
13.6 12.5 16.4 5.7 1.5
2 667 24 649 7 099
32 1 504 39
6.1 0.5
1.2
12 728 14
0.4 3.0 0.2
44 600 159 13 065 706 22 970 814 73 920 274
1 439 011 875 194 2 854 332 13 198 853
3.2 7.5 12.4 17.9
678 176 433 487 1 592 403 7 373 443
1.5 3.3 6.9 10.0
LAND USE Total area (ha) Permanent crops Temporary crops Natural pasture Artificial pasture VALUE Total value (thousands of cruzeiros) LABOR FORCE Total number of persons occupied on holdings Male Female Paid employees Sharecroppers EQUIPMENT Silos Trucks Tractors (10 mph and over) LIVESTOCK AND POULTRY Cattle Sheep Pigs Chickens
1 This table was reproduced from a report submitted to the Government of Brazil by T.B. Jabine, United States Bureau of the Census, who has kindly arranged to make it available for use in this book.
a function of the reduction in the number of holdings which remain to be enumerated after this combination of methods has been adopted. If a 10 percent sample of holdings of less than 10 hectares gives precise enough estimates, it means that instead of a complete enumeration of
52
SAMPLING METHODS AND CENSUSES
the whole population, this technique has reduced the number of holdings to be enumerated to 65.6 + 3.4 = 69.0 percent. It certainly does not mean that the use of this technique has reduced the budget to 69 percent of what would be needed for a complete enumeration of all holdings. Part of the total budget is the overhead cost in preparations and is not affected by the use of auxiliary sample censuses. A situation similar to that described in previous examples can be found in many countries. Table 11 presents, in a somewhat different way, the situation in Ecuador in 1954. Column 1 contains contribu tions to totals for certain specified items coming from agricultural hold ings with an area of less than I hectare, column 2 shows the contribu tion of holdings of 1-5 hectares, and column 3 of those larger than 5 hectares. If holdings of less than 1 hectare in size are sampled and the rest enumerated completely, similar savings are possible as before. TABLE 11. - PERCENTAGE CONTRIBUTION OF HOLDINGS OF DIFFERENT SIZE CLASSES TO COUNTRY TOTALS FOR CERTAIN SPECIFIED CENSUS CHARACTERISTICS IN ECUADOR, 1954 1
Item
Size of holdings (in hectares) Less than I
1-5
5 and over
Number of holdings Total area Cultivated area
26.8 0.8 2.1
46.3 6.4 15.6
26.9 92.8 82.3
Production of: Maize Beans Barley Wheat Potatoes Winter rice Summer rice Bananas Plantains Cocoa Coffee
8.1 10.4 4.8 1.5 2.9 3.8 3.4 0.5 0.8 0.2 0.7
37.9 44.5 38.1 22.8 21.8 20.8 28.5 10.0 10.5 3.8 12.6
54.0 45.1 57.1 75.7 75.3 75.4 68.1 89.5 88.7 96.0 86.7
1 Data presented in this table are based on figures published in Primer Censo Agropecua rio Nacional, 1954, Repub/ica de/ Ecuador, Quito, Ministerio de Economia, 1956. This publication was prepared and the corresponding census taken with the assistance of the FAO expert Dr. P. C. Tang.
Auxiliary sample censuses may prove even more useful in other fields, such as in censuses of business or manufacturing establishments, where often a small number of big firms will account for the basic part of the production or sales. Table 12 represents a typical example (data on
53
AUXILIARY SAMPLE CENSUSES TABLE
12. -
Production group
NUMBER OF SAWMILLS AND THE PRODUCTION OF LUMBER BY VARIOUS PRODUCTION SIZE GROUPS 1
Annual production
Number of mills
.. Mbdft .. 1 2 3
Total
5 000 and over 1000-4 999 Under 1000
538
Total production of group
Average production
Standard deviation
. . . .. . . . . . . . . . Mbd ft . . . . . . . . . .. . . . . 5 934 000
11 029.7
9 000
4 756 30 964
8 464 000 6 311000
l 779.6 203.8
1 200 300
36 258
20 709 000
571.2
1684
1 With some changes this table was taken from Morris H. Hansen, William N. Hurwitz and William G. Madow. Sample survey methods and theory, New York. Wiley. 1953. Vol. I, p. 205, with the kind permission of the authors and publisher.
lumber production of sawmills in various production size groups). It will be seen from Table 12 that the largest number of sawmills is in group 3, which has a rather small average production. If we assume that sawmills in the first two groups are enumerated completely and an auxiliary sample census of 10 percent of sawmills were taken in group 3, the estimated total production of all the mills would have the following coefficient of variation expressed in percentages:
300 30,964 ¥3096 = ---'-xlOO 20,709,000
= 0.8% In other words, a high precision of the estimated production was obtained by reducing the enumeration to 538 + 4,756 + 3,09 6 = 8,390 units, which makes less than a fourth of the total number of units. Designing auxiliary sample surveys The most difficult and probably the most important problem in de signing auxiliary sample surveys will be the question of the selection of the sample. The efficiency of this combination of a complete enumer-
54
SAMPLING METHODS AND CENSUSES
ation census and a sample census will depend largely upon the solution found to this problem. The procedure to be adopted for the selection of the sample dep�nds upon the conditions under which the census is being taken. In the 1952 Census of Agriculture in Ceylon, where the complete enumeration was planned as a mail survey based on available lists of estates, the most convenient design of the sample census probably required a sample of area units. A list of small holdings was not available and had to be prepared in the field in the selected area units. When these lists become available, enumeration of all the small holdings listed can be under taken, as was done in this census, or of a subsample only, whichever is found to be more efficient from the theoretical point of view, or more convenient. Such an approach may prove to be of great importance if it cannot be assumed that a large number of enumerators, usually em ployed in complete enumeration censuses, are qualified enough to pre pare lists and select a sample therefrom in the prescribed manner.
In censuses of manufacturing, establishment lists are perhaps avail able of all the units as kept for purposes of taxation or of social insur ance administration. A special listing of units is then not necessary and the sample is selected centrally from existing lists. The available addresses might further be employed for data collection by means of mail questionnaires. Where such expedients are possible they may of fer a considerable reduction in costs. In many countries, however, the rate of response to a mail survey is very low. This is why the attempt to get information on estates in Ceylon by such means failed. Also, all the measures for extracting more information, such as successive reminders and field visits, had a very limited success. As a result, the census as a whole broke down. This example is usually an instructive experience: in a mail survey of agricultural holdings one cannot expect more than a very moderate re sponse. If, for one reason or another, sound statistical techniques cannot be used to deal with missing data, as was the case in Ceylon, because of personnel and budgetary limitations, mail surveys should probably be replaced by interviewing on the spot, which might be less attractive from the cost point of view but safer. Another reason for doing so is the fact that accurate lists are rarely to be had. Field visits are then made to bring lists up to date. Another method of selecting the sample will be by entrusting the task to the enumerators. If a census of population is taken prior to the
AUXILIARY SAMPLE CENSU8m
55
census of agriculture, or, if the two are planned to be taken simultaneously and lists of households are prepared throughout the whole country, these lists usually contain some data, such as area owned or operated, which makes it possible to separate households that are considered as agri cultural holdings on the basis of the definitions adopted. The agricul tural census enumerators are then instructed to enumerate all the holdings larger than a specified lower limit of size (complete enumeration census) and every k-th small holding (auxiliary sample census). If the census of agriculture is taken independently of the census of population, the enumerators might be instructed to proceed with their canvassing according to a prescribed plan and list all the agricultural holdings by enumerating those above a specified size and only every k-th of those below that size. The lists resulting from this procedure make it possible later to check the accuracy of listing. They also reveal the total number of small holdings, which may prove a useful item of infotmation in the process of estimation. With this procedure of a complete listing of holdings, the saving is certainly less considerable than in cases where the existing addresses of large units are being used and listing is restricted to some areas only. With a complete listing, the saving amounts to the cost of interviewing and processing of N, ( I -{) units which were not enumerated in the stratum of small units. If this saving is small as compared to the total budget, it might be reasonable to reconsider the chances of taking a complete enumeration census of all the holdings and secure all the advan tages resulting therefrom. Another disadvantage of this method of selecting the sample is the danger of biases that might be introduced into the results if the enumer ators deviate from the procedure prescribed. Such deviations occur if the enumerators change the structure of the sample more than is account ed for by chance variation, namely, if they select a significantly smaller sample from what was prescribed and the process of estimation takes into account only what was prescribed and not what the enumerators may happen to do. 6 Against these disadvantages it should be pointed out that some such version of this method of selecting the sample is the only possible way of meeting needs for data by relatively small administrative units. What 6 These biases might have considerable consequences and decrease seriously the ac curacy of final results. This problem is discussed in Chapter 5.
56
SAMPLING METHODS AND CENSUSES
is needed in such cases is both a large size of sample and its spread over all parts of the population to be sampled. These two points are achieved by systematic sampling or a procedure leading to similar effects, so that there may be no efficient alternative to this type of approach. The precision aspect of estimates obtained from auxiliary sample cen suses does pot constitute a problem of the same importance as it does in normal sample censuses. The part of the population with which auxiliary sample censuses are called upon to deal contributes relatively little to a country's totals, so that even a large error becomes of comparatively small importance when expressed as a percentage of the country's total. For example, if the part of the population sampled contributes only 5 percent to the total for the country as a whole and this contribution is estimated as having an error of 20 percent, the final estimate of the total for the country as a whole will have an error of 1 percent. In other words, funds and effort should not be sacrificed unduly in increasing the efficiency of auxiliary sample surveys, although, obviously, this be comes increasingly important as the contribution of the population sampled becomes larger or if data are needed for small administrative units. Another important question which sometimes arises in designing auxi liary sample censuses is the position of the boundary (cut-off point) between the part of the population to be enumerated completely and that to be sampled. The importance of this is clear from illustrations given above. If more units are left to be sampled, this increases the variance of that part and leads to a larger sampling error in the combined total for the two strata together, provided the size of the sample remains the same. The rate of increase of the variance corresponding to the increase of the number of units in the part of the population to be sampled depends in turn on the form of the frequency distribution. Furthermore, increasing the part to be sampled decreases simultaneously the number of units in the complete enumeration census, which in itself means a re duction in cost. The relative importance of these gains or losses depends on differences in cost of collecting data in the t\\'o strata. If one stratum is surveyed by mail and the other by interview, it goes without saying that increasing the size of one stratum and decreasing the other will have quite different consequences than if the cost of obtaining data were assumed to be the same in both strata. W e thus see that a number of factors can be taken into account in deciding on where to place the boundary line between the two parts of
AUXILIARY SAMPLE CENSUSES
57
the population. In many cases the solution will already be implicit in the conditions of work. For example, in the 1952 Census of Agriculture in Ceylon this boundary line was determined by the fact that a file of addresses existed for all holdings of 20 acres and over. In other cases the division of the size distribution might be guided by some special in terest in a particular class of holdings, say those growing commercial crops. In the example taken from the 1954 Census of Agriculture in Ecuador, such crops are bananas, cocoa and coffee. The nature of this interest may be such that the lower limit has to be placed at 1 hectare be cause the contribution to the total production of these crops in holdings lower in size is negligible. This would be an instance of division of the size distribution on nonstatistical grounds. Elsewhere, however, the statistician will have a certain degree of freedom in selecting the point of the frequency distribution above which the complete enumeration wil1 be applied. In such case, the whole problem has to be viewed in the light of optimum stratification, and the problem is to find the boundary between the two strata in such a way that the cost of both the complete enumeration and the sample census is reduced to a minimum for a specified precision of estimates. Unlike the usual situation in sample surveys when all the strata are sampled, in this particular case one stratum is enumerated completely and the other is sampled. A technique has been developed by T. Dalenius6 for finding, in such cases, the boundary between the two strata which sat isfies optimum requirements. The application of this technique assumes that a certain amount of data on frequenc) distribution is available. This may prove a source of difficulty because auxiliary sample censuses are in most cases intended to cover that part of the population about which the information, if available at all, is vague. Another difficulty will be due to the fact that the use of this technique provides a simple solution of the problem only in the case of a single variable. In agricultural censuses, however, many variables are involved and if the technique were applied to all of them, the resulting boundaries for particular items would certainly vary over a wide range. In such cases the computation of the optimum boundary, if data are available, can be made for a few impor tant items. When the results are obtained it will be easy to select the 6 Tore Dalenius. Sampling in Sweden, Stockholm, Almquist and Wiksell, 1957, Chapter 7. The reader interested in the various aspects of the theory involved will find in this book a full bibliography on the subject. See also Heinrich Strecker. Mo derne Methoden in der Argarstatistik, Wilrzburg, Physica Verlag, 1957, p. 80-95, where the technique of optimum stratification is illustrated by concrete example.
58
SAMPLING METHODS AND CENSUSES
one considered most convenient. The boundary finally selected may not correspond exactly to the optimum for any of the variables involved but it will, in the long run, certainly satisfy the optimum considerations better than what is otherwise the only alternative, namely, guesswork.
5. BROADENING THE SCOPE OF CENSUS PROGRAMS
The problem The term "statistical age" has been used to characterize the contem porary period of history. This is not to suggest that statistics is to be considered the most important and most typical activity of modern times; it simply records the fact that today statistics have penetrated every phase of life. In contrast to remote periods of history, our age may truly be said to be a statistical one. Hardly any branch of science is without its application of statistical methods. In some fields statistics has become a basic tool for research. Side by side with this, the increasing role of the modern state in various aspects of social life would be unimagin able without statistics. This is why governments and other public author ities and even private interests are encouraging the extension of statis tical activity. Whatever planning is attempted for whatever purpose, statistical data are sought as a basis. In consequence of this interest, we find an ever-mounting pressure on agencies responsible for collecting data to enlarge their programs. Need less to say, censuses do not escape this trend, being as they are one of the most important sources of statistical information. A few examples will show to what results this pressure can lead. In the 1956 French Census of Agriculture, the program consisted of about 465 questions. In the 1951 Canadian Census, the agricultural schedule comprised nearly 200 items and the additional irrigation schedule another 22 items. In the 1950 United States Census of Agriculture, the basic questionnaire contained more than 330 items, many of which were broken down into several questions. In addition to this, there was a landlord-tenant oper ations questionnaire, a special agricultural questionnaire, an irrigation questionnaire, a drainage questionnaire and a special questionnaire for producers of cµt flowers and flowering or foliage plants. In the 1950 Agricultural Census in Japan there were 133 questions on the basic schedule and more than 100 questions on the supplementary schedule; each question had a number of subquestions. Namely, for each crop
60
SAMPLING METHODS AND CENSUSES
item, information was requested for area harvested, total yield, quantity sold, and sums received. Thus the total number of questions was much higher than 100 + 133. The consequences of this may well be enormous, since each additional question means more time spent on each unit with a corresponding increase in expenditure and human effort. These consequences can easily be visualized in a population census where millions of inhabitants are involved, or in censuses of agriculture as we know them today which have less units to enumerate but carry a far more complicated list of items to be dealt with. It is precisely here that sampling methods come in: the census program is divided into two parts, one to be carried out as a complete enumer ation census and the other as a supplementary sample survey. On the program of the former are usually to be found those items which are con sidered to be the basic census information and which are generally tabu lated by small administrative units. In censuses of agriculture, this kind of information will concern the total number of farms and their size, basic data on land tenure, land utilization, farm population, livestock, etc. In censuses of population, on the other hand, such items are age, sex, marital status, and the like. Only those items for which information is not required by small administrative units are normally included in the supplementary sample survey. In censuses of agriculture, the items in this part of the program will concern farm facilities and equipment, ex· penditurc, mortgage debts, taxes, etc. In censuses of population, the items involved are migration, education, income, employment or unem ployment, and so on. It is in this way that data collected by means of the supplementary sur vey broaden or enlarge the scope of the basic complete enumeration census. It goes without saying that the division of the total program into these two parts depends upon needs and local conditions. The examples that follow illustrate the solutions of this problem in certain countries. Some illustrations
The idea of thus dividing the total census program into two parts so that one can be considered as basic census information and the other as supplementary, was in use before the practice began of applying
BROADENING THE SCOPE OF CENSUS PROGRAMS
61
sampling methods for this purpose. For example, in the 1927 Census of Agriculture in Turkey, the basic census enumeration was carried out at village level, i.e., the estimates of village totals were supplied to enumerators by village authorities. Country totals were then obtained by summing up data for villages. Clearly, however, the method was unable to supply information on holdings and their characteristics, so an additional enumeration of six holdings in each of the 8,000 villages was taken and this yielded information to supplement what was known from the basic census.1 The first adaptation of this idea on the basis of modern theory of sampling surveys was made in the 1940 United States Census of Popu lation, where the United States Bureau of the Census was faced with the difficulty of how to embark upon a .relatively large census program on the one hand and, on the other, carry out the enumeration of more than 130 million people in a short period of a few weeks. The solution adopted was to reduce the size of the complete enumeration census pro gram and include all those items on the program of the supplementary sample survey which permit tabulation by "large cjties, states, major geographic regions and the United States." 2 The sample survey pro gram included the following items. "A. Parentage and mother tongue: place of birth of father, place of birth of mother a:id mother tongue (language that the person spoke in his home in earliest childhood). B. Veteran status. (Is this person a veteran, the wife, widow or under-] 8-year-old child of a veteran?) C. Social security (Does this person have a Social Security number? Were deductions made from his earnings during 1939 for Federal old-age insurance? - all for personfi 14 or over.) D. Usual occupation, industry, and class of worker (for persons 14 or over). E. Fertility data (for women who are married, widowed or divorced: Has this woman been married more than once? Age at first marriage and number of children ever born)." 3 1 Cf. 1950 Census of Agriculture, Ankara, Central Statistical Office, 1953 (pamphlet). 2 F.F. Stephan, W,E. Deming and M.H. Hansen. The sampling procedure of the 1940 Population Census, Journal of the American Statistical Association, Vol. 35, 1940, p. 615-630. 3 lbid.
62
SAMPLING METHODS AND CENSUSES
In the attempt to arrive at a decision as to what shduld be the program of the sample census, the opinion held in this case was that items " ... in which the primary objective is the ascertaining of general relationships, or obtaining information needed only for large geographical areas, for studies of various economic and social relationships, and for recommend ing courses of action, are ideally suited for sample treatment." 4 The items included can be divided into three classes. " In the first class are inquiries relating to subjects of greater importance in the past than at present, which were included to provide statistics permitting the continuation of time series. In this class are the inquiries relating to nativity of parents and mother tongue. With the virtual cessation of immigration during the past decade and with the dwindling of the number of foreign-born persons in this country, these questions, in light of the demand for more pressing inquiries, could not be carried on the popu lation schedule for the complete canvass. The collection of information relating to these items on a sample basis, however, will provide data ade quate for the continuation of the historical series derived from previous censuses. '' In the second class are the inquiries designed primarily to obtain information for administration or for the formulation of administrative or legislative policy. In this class are inquiries relating to social security coverage and to veterans' status. The Congress of United States, the Social Security Board, the Veterans' Administration, and many other agencies are vitally interested in these inquiries on the population sched ule. The facts collected on a sample basis, which otherwise could not have been obtained, will be of considerable importance in determining legislative and administrative policy affecting the welfare of millions of citizens and the expenditures of literally billions of dollars. " Finally, in the third class are inquiries of general economic and so cial significance, the primary objective of which may be termed scientific, that is, the determination of broad relationships and the general illumina tion of problems to which they are addressed. Such inquiries, it may be added, will, in the long run, also have important administrative and legislative import. In this class are the census questions relating to fertility and usual occupation. The purpose of the fertility questions is not to obtain small area statistics but rather to ascertain and measure the important factors associated with fertility differentials. The purpose 4 Philip M. Hauser. The use of sampling in the census, Journal of the American
Statistical Association, Vol. 36, 1941, p. 369-375.
BROADENING THE SCOPE OF CENSUS PROGRAMS
63
of the questions relating to usual occupation is to determine, in general, rather than in specific small localities, the extent to which persons are at work in " distress " occupations. There is no question but that these purposes will be served satisfactorily by the sample data collected." 5 This lengthy citation has been introduced because it shows_ that select ing the program for the supplementary sample survey involves, besides budgetary considerations, a deep insight into a country's needs and col laboration among institutions if the relative importance of many factors is to be adequately appraised. A further problem in the use of sampling methods for the purpose of enlarging the scope of a complete enumeration census is that of the selec tion of units. This is important not only because whatever system is adopted is essentially linked to the magnitude of sampling errors and the possibility of getting biased results but also because the selection of the sample has to be well adapted to the whole census organization. Other wise the selection of the sample may result in high costs, disturbances in a number of census operations, confusion and even biased samples. In the case cited, the technique used for selecting the sample was based on the characteristics of the questionnaire customarily employed in the United States Census of Population. In the 1940 census, this question naire consisted of a sheet with 40 lines on each side, containing thus the space for data of 80 persons enumerated. Each of these 80 lines was provided with its serial number. Data for each person were entered on a separate line according to a prescribed order, viz., within the family, first the head had to be enumerated, then his wife, their children by order of age, and then the relatives and servants. According to the organization adopted, it was easy to distinguish all the types of units usually met with in censuses, such as individuals, families, dwelling units, enumeration districts and the administrative units of vary ing rank. By studying available material, however, it was found that in this case the most convenient units would be persons, and a sample of 5 percent of persons would be sufficiently large to guarantee the pre cision needed for the estimates of most census items. In other words, the size of the sample needed can be obtained if 2 lines are selected out of 40 on each side of the schedule. An unbiased sample of these 2 lines would be obtained if the enumer ators were trained to use the table of random numbers. However, it was 5 Ibid.
64
SAMPLING METHODS AND CENSUSES
found necessary to eliminate the influence of enumerators in this selec tion. For this reason, the 2 sample lines on each side of the schedule were selected during the preparations for the census and indicated on the questionnaire in special print. The selection of these sample lines was such that they were supposed to give a random sample of persons enu merated, provided the prescribed order of enumeration was respected. It is also worth mentioning that the system of the selection of the sample was such that the sampie could be considered an automatic product of the process of enumeration, in that whenever the serial number of the selected sample lines was reached, a new sample unit was determined. In the field work, every person, irrespective of the line on which he or she appeared, was asked to give the answers on the complete enumeration census program. Only those who happened to fall on sample lines were then asked to give the answers on the additional questions. In the 1950 United States Census of Population, the idea of broaden ing the scope of the census was again applied but in a more evolved man ner. In this census two samples were used: a 20 percent sample and a 3 YJ percent sample. These two samples contributed a considerable part of the information deriving from the census. " The items included in the 20 percent sample were those for which summary data were to be tabulated for small areas but not in great detail and for which require ments for purposes of analysis did not require complete counts. Tabula tions of sample data were generally made for areas consisting of 2,500 population or more, but cross-tabulations only for much larger areas. The 3 YJ percent sample was used for population items for which publica tion was planned only for very large areas, such as the United States, regions, states, or cities with a population of 100,000 or more, with varying degrees of cross-tabulation." 6 In all other respects the technique was basically the same as that used in the census of 1940. The idea of the sample of lines was again applied. Each side of the questionnaire used in this census had 30 lines. To get a· 20 percent sample one out of each 5 lines had to be drawn into the sample. The 3 YJ percent sample was obtained by using information available on the last line representing the 20 percent sample on each sheet. The sample lines were again indicated on the questionnaire by means of special print. In this case persons were also used as sampling units, 6 J. Steinberg and J. Waksberg. Sampling in the 1950 Census of Population and Hous
ing, Washington, D.C., United States Bureau of the Census, 1956. Working Paper
No. 4.
BROADENING THE SCOPE OF CENSUS PROGRAMS
65
although there was a great interest in the household as the unit, offering as it does a number of advantages for this type of work. It was not used, however, owing to difficulties in providing simple instructions to enumerators on how to deal with institutional or similar large quasi households. 7 For this reason the advantage of using sampling methods for broadening information on households had to be sacrificed. In the 1960 United States Census of Population and Housing, a fur ther step was made toward a more extensive use of sampling methods. 8 First of all, the number of questions included in the 100 percent count of population was reduced to five items only. They are: (i) relationship to head of household; (ii) sex; (iii) color or race; (iv) month and year of birth; and ( v) place of birth. All other questions, among which are those relating to the economic characteristics of the population, were collected from a 25 percent sample of population. Similarly, in the cen sus of housing only about ten questions were asked for each housing unit, while the rest was put on the supplementary program of two samples, one consisting of 20 percent of the units and the other 5 percent. In 1960, the sampling unit was changed. In earlier censuses, sampling units were persons. In 1960, it was the housing unit and all its inhabit ants. In institutional households, the person was retained as the sam pling unit. This change in the units used has brought about a decrease in the efficiency of the sample for many characteristics. For this reason, the earlier sampling fraction of 20 percent was increased to 25 percent in this census. However, the new sampling unit offered larger possibil ities in the tabulation of household and family characteristics. As to the selection of the sample, each fourth housing unit was selected from the lists established in the first canvass of the enumeration district. Letters A, B, C and D were assigned in that order to units listed. The letter to be used for the first unit listed was established separately for each enumeration district, depending upon the last two digits of the serial number of the enumeration district involved. The unit selected in the sample was always the one designated "A." Our next illustration concerns the 1945 United States Census of Agri culture, in the preparatory stages of which many requests were received for enlarging the program. The facilities available fell very far short of needs, and certain specific measures were thus necessary to meet at least 7 Ibid. II Cf United States Bureau of the Census. Procedural report on the 1960 Census of Population and Housing, Washington, D.C., 1963. Working Paper No. 16.
66
SAMPLING METHODS AND CENSUSES
some of the increased requirements for statistical information on agricul ture. A first step was the regionalization of the questionnaires, and the method adopted in this case consisted in placing certain items of interest to certain regions and not to others on a questionnaire to be used in those regions only (for example, certain crops and fruits, the existence of prac tices such as hunting and fishing), without burdening the general question naire with them. A number of items which could not be treated in this way were put on the program of the supplementary sample survey. They were: automobiles, trucks on farms, electric motors, stationary gasoline engines, age distribution of mules, horses, cattle, sheep, and some other special items, such as the number of cows milked the day before the enumer ation, quantity of milk produced, number of hens and pullets of ·lay ing age a day before the enumeration, quantity of milk produced, eggs produced that day, milk, chickens and eggs consumed, number of hives and bees kept on the farm, production of honey, etc. In addition, ques tions were asked on mortgage, on farm work, on work away from the farm, certain mechanical facilities, such as refrigeration and power driven washing.9 The sample used in this program was the so-called master sample 10 and, to prepare this, every county in the United States was subdivided into small area (approximately 2.5 square miles) segments having about five farms each. The sample used consisted of I out of every 18 such segments. The total number of segments actually selected was roughly 67,000. By this means, practically every county was represented in the sample. Once this sample of segments had been established, the infor mation on items included in the supplementary survey program was col lected from all the farms having their " headquarters " (farm dwellings, farm buildings or farm entrance) within the selected segments. To im.. prove the estimates, a complete enumeration of approximately 50,000 large farms was carried out in addition to this sample. They were enu merated for both the complete enumeration census program and the sample survey program, irrespective of whether their headquarters fell within the selected areas or not. The criteria for selecting these farms e Further details on this sample program can be found in United States Bureau of the Census. U.S. Census of Agriculture, 1945, Vol. 2, General report: statistics by subjects, Washington, D.C., U.S. Government Printing Office, 1947. 10 More details on master sample will be found in A.J. King and R.J. Jessen. The master sample of agriculture, Journal of the American Statistical Association, Vol. 40, 1945, p. 38-56 and R.J. Jessen. The master sample project and its uses in agricultural economics, Journal of Farm Economics, Vol. 29, 1947, p. 531-540.
BROADENING THE SCOPE OF CENSUS PROGRAMS
67
varied from one region to another. Lists with the addresses of such farms were given to the enumerators before the field work started. The size of the sample for the supplementary information was thus about 1 farm in 14 while, in terms of area and production, it was still larger be cause of the inclusion of big farms. It should also be pointed out that this sample was designed to give estimates for sample census items only for large regions or states. Broadening the scope of the census by means of sampling methods was also done in the 1950 Census of Agriculture in Japan, where 181,744 enumerators were appointed to enumerate 6,180,000 farm households. In this situation, the limitations on the scope of the census were serious from many points of view. A serious handicap, in addition to shortage of funds, to putting many items on the regular census program was the need to train enumerators, for it was felt that reasonably accurate answers on a number of questions could be obtained only if the task were entrusted to persons with a good knowledge of agriculture and local conditions. The supplementary sample survey was taken simultaneously with the complete enumeration census. Its program was mainly composed of items which are generally more complex and require more qualified in terviewers. The program of the supplementary survey included: detailed information on each person living in the household selected for the sample survey, and on farm labor in general; (ii) details on the animal and motor power used for operations on hold ings during 1949; (iii) data on the use of the total area belonging to the holding; (iv) area harvested, total yield produced, quantity and value of crops marketed for each item in a long list of crops; (v) money borrowed during 1949, with a specification of expenditures thereof and the rate of interest charged; (vi) amount of debts in farm households and the amount of savings in agricultural co-operatives. 11 (i)
The field work in connection with this sample survey was carried out by the field staff of the Ministry of Agriculture and Forestry. The Crop 11 Cf 1950 World Census of Agriculture in Japan (Interim Report), Tokyo, Ministry of Agriculture and Forestry, Statistics and Research Division, 1951.
68
SAMPLING METHODS AND CENSUSES
Reporting Office which operates under this ministry, has branch offices whose employees were used as enumerators. The sample was selected in the following way: statistical agencies in the prefectural governments were charged with the task of preparing lists of farm households. These lists were then sent to the branches of the Crop Reporting Office, where a serial number was given to all the holdings, excluding those of an institutional character. On the basis of these lists, a 5 percent systematic sample of farm households was se lected by choosing at random the starting point and then taking every 20th household. Each employee of the Branch Crop Reporting Office carried out the enumeration in those units which happened to fall within the region under his charge. In other words, the method used was such that the supplementary sample survey was taken by a special group of highly trained enumerators. In the 1950 Census of Agriculture in Finland, an extremely large pro gram was embarked upon, and several interesting devices were used to put it in operation. In the basic questionnaire, which was obligatory for all the holdings with more than 2 hectares of arable land, the follow ing groups of items were introduced: (i) general data and conditions of tenure; (ii) the lands of the holding and data on their use; (iii) use of the arable land in the summer of 1950; (iv) gardening; (v) establishments for increasing the efficiency of labor; ( vi) underground drainage; ( vii) data on livestock as of 15 June 1950; ( viii) livestock production (milk, eggs, slaughtering); (ix) manure care; (x) agricultural machinery; (xi) quantity of wood sawn; and (xii) fishing. One further questionnaire was oblig atory for all the holdings, and this concerned only the manpower used on the holding during the calendar year 1950. In addition, there were special forms on gardening, fishing, handicraft, and fur-bearing animals. Information on these special points was not collected from all the hold ings but only fr@m those which happened to meet the criteria established. For example, data on gardening were collected only from the holdings having one or more greenhouses, or more than 0.5 hectare of cultivation of fruit trees. Data on fishing were collected only from those holdings on which fishing was exercised and where the annual catch was 50 kilo grams or more. Data on handicraft were asked only if the holding was exercising handicraft for sale, etc. By doing so, an effect was reached similar to the idea of regionalization of the questionnaire already mentioned in connection with the 1945 United States Census of Agriculture. By using the device of regionalization, all the farms within a given region
BROADENING THE SCOPE OF CENSUS PROGRAMS
69
were enumerated on the basis of the same questionnaire, while in this case, the selection of farms for a particular special form was determined by enumerators on the basis of information obtained. A further step in enlarging the scope of the census was made by the use of sampling methods. For this purpose, a sample of farms was selected from the lists available before the census started. The sample farms were instructed to keep records on certain of their activities during the period 1949/50, and when the census enumerators appeared, these records were used for preparing answers to corresponding questions. The questionnaire which wa� used in order to collect additional infor· mation from the sample holdings contained the following groups of ques tions: (i) general data on the holding; (ii) drainage of the arable land; (iii) kind of soil on the arable land; (iv) use of fertilizers, manure and soil improvement materials in 1950; (v) particulars on the household facilities; (vi) hunting; (vii) data on the use of the previous year's harvest (quantity produced, sold, fed to animals, used as seed, in store on 15 June 1950); and (viii) data on livestock production. An interesting feature of this supplementary sample survey was the attempt to increase the quality of its data by requesting farmers to keep records during the period 1949/50. As is well known, statistical data are likely to contain memory biases if the period of reference is long. In this sample survey, there were a number of items typical of this sort of diffi culty. In the group of items on livestock production there were ques tions such as total production of milk on the farm during the period 1 July 1949 through 30 June 1950, average daily consumption of milk for humans and animals. total annual production of eggs, etc. The answers to these and similar questions are likely to be inaccurate unless particular provisions are taken against the incidence of errors. The device used in this instance wus that of keeping records. The same sample of holdings was used to obtain additional information on crop production where, for reasons of quality, the questions were not put on the basic questionnaire which was compulsory for all the hold ings. The census was taken in the summer of 1950 and it was felt that data on crop production in 1949 might be inaccurate. This is partic ularly so in the case of holdings which a re not specialized and produce a large number of crops of which part is sold and part used on the farm. To avoid the danger of inaccuracy that might have inserted itself on this count, another device was used that could also be of interest from the methodological point of view. When the enumerators contacted
70
SAMPLING METHODS AND CENSUSES
the sample holdings for information on items contained in the basic ques tionnaire and the above-mentioned sample survey program, they left at each of these holdings a provisional crop production questionnaire. For each crop listed on this form, data were required on area harvested, total yield, the part of the harvest sold or to be sold, and yield per hectare. Farmers were instructed to fill in this questionnaire with the correspond ing data as soon as the particular operation was finished and the desired answer became known. By doing so, complete information on crop production gradually became available and the enumerators were instruct ed to visit the sample farms again at the end of 1950 and fill in the real crop production questionnaire on the basis of the information on ques tionnaires left with the farmers. Another example i1Iustrating the use of sampling methods in broaden ing the scope of the census is to be found in the 1956 Census of Agri culture in France, where a very interesting technique was used. 12 On the program of this census there were approximately 465 items. Obviously, it would be difficult to ask each farmer to answer so many questions. Therefore, about 135 items were selected to represent the program of the complete enumeration census, while the remainder was left for the sam ple survey program. It was also discovered that an additional program of about 330 questions would be too heavy a burden for any individual farmer. Prolonged questioning is likely to produce fatigue and, conse quently, inaccurate results. Similarly, if the questionnaire is long and time-consuming, the farmers will be less disposed to collaborate. In order to avoid difficulties that might arise, it was decided to distribute the total additional program among all the farmers according to the fol lowing technique. Out of the total population of holdings, 10 independent random samples were drawn, each of which had a size of 1 /io of the population. In other words, every farm was an element of one of these 10 samples. Next, the total sample survey program was divided into ten approximately equal parts, and each of these parts was associated with the complete enumeration program. By this means, 10 different sup plementary survey programs were obtained, and each of them was printed on a separate questionnaire. All the questionnaires were equal with regard to the complete enumeration census program and were different for the sample survey program. The final aim of dividing the total bur den of the supplementary program among all the farmers was then achiev12 Cf M. Desbrousses and M. Contan. Le recensement general de l'agriculture, Revue du Ministere de /'agriculture, No. 128, 1957, p. 104-107.
BROADENING THE SCOPE OF CENSUS PROGRAMS
71
ed by using one type of the questionnaire for all the units within a given sample. For tabulation purposes, questions common to all the ques tionnaires were treated separately, as in any other complete enumeration census. As far as the additional program was concerned, data on each questionnaire of a given type provided the basis for deriving estimates of corresponding items on the supplementary program. Needless to say, it was not possible to use an equal number of ques tions from the additional program on each questionnaire because these questions had to be grouped according to subject matter. On the other hand, some questions from the sample survey program were, in fact, repeated on several questionnaires. This was necessary because of cross classification of items belonging to different subject-matter groups. For this reason, the distribution of sample survey questions on individual questionnaires has to be carefully adapted to the type of tables that one wishes to draw up from the census. Another problem with this technique is the selection of the sample. In point of fact, each of the l O samples used was supposed to be rep resentative of the population being selected from the files of farms. The work of creating such a file started in 1949 for certain nonstatistical purposes and ended in 1953 before completion. In 1955, however, the statistical service started improving these files. For each farm, a short questionnaire was filled in and all the questionnaires for a commune were sent to provincial statistical offices. In these offices, all the com munes under their competence were first classified according to agri cultural regions and then within each region into two size classes: those over 50 hectares and those under 50 hectares. Within these two size classes the farms were further grouped by communes and within each commune by size in decreasing order. Then, within each of these two size classes all the holdings were given serial numbers by letting the largest holding in the first commune be number 1 and carrying through the numeration over all the holdings in the size class within the region. Then, all the holdings having a seria] number ending in 1 were taken as elements of the first sample and were enumerated on questionnaire no. 1. The holdings with serial numbers ending in 2 constituted the next sam ple and were enumerated on questionnaire no. 2, etc. It is seen that there was no difference between the holdings under 50 and over 50 hec tares as to the composition of the sample. The real reason for creating these two size classes was that more complete information was needed on large holdings.
72
SAMPLING METHODS AND CENSUSES
It might be interesting to note in this connection that a similar technique was used in selecting the sample for broadening the scope of the 1950 Census of Housing in the United States. 13 The whole program of the supplementary sample survey was divided into five groups and the ques tions in each of these groups were asked at 20 percent of the housing units. As a result, 5 samples of equal size were established and each of them provided data on different questions. It was found afterward that this sample introduced serious difficulties in the establishment of the cross-tabulation program, as cross-classifications of items on the program of different samples were not possible. This is why it was de cided to abandon this technique in the 1960 Census of Housing. In this latter census, a 25 percent sample of housing units was selected and split up into 2 samples, one of 20 and the other of 5 percent. The data collection program of these 2 samples was not identical. Some question� were on both programs; in other words, they were asked at 25 percent of the housing units. Other questions were posed at 20 percent of the units while, finally, the balance of 5 percent was used for questions in the third group. For this reason, cross-classifications between items on the latter two programs were not possible. However, cross-clas sifications of items in samples of 20 and 25 percent, as well as 5 and 25 percent, were possible. In the 1954 United States Census of Agriculture, the selection of the sample for the supplementary survey was based on Form A2 (reproduced in Appendix 1). On this form, the enumerators had to enter each household in their district with specified additional data which made it possible to decide whether a particular household had to be considered as a farm and the agricultural questionnaire had to be applied or not. For all the households which qualified for the agricultural census, the enumerators also had to ask the total area operated and put a cross in column 14 in the cell which corresponded to the size of the holding. By doing so, they created a classification of holdings into five classes: (i) under 30 acres; {ii) 30 to 99 acres; (iii) I 00 to 299 acres; (iv) 300 to 999 acres; and (v) 1,000 acres and over. In the columns corresponding to these size classes in Form A2, every fifth cell will be found shaded. If the cross happened to fall on the shaded cell, the holding listed in this line was selected in the sample and the enumerator was instructed to ask 1a For more details see United States Bureau of the Census. Procedural report on the 1960 Census of Population and Housing, Washington, D.C., 1963. Working
Paper No. 16.
BROADENING THE SCOPE OF CENSUS PROGRAMS
73
such farms for data on the supplementary sample survey program. It will also be seen that all the cells for holdings of 1,000 acres and over are shaded. This means that all these are included in the sample. Because of this procedure with large holdings, which was needed for the purpose of improving the precision of estimates, the final size of the sample was larger than 20 percent of all farms. In fact, it amounted to 22.5 percent of all holdings in the United States.14 In the 1959 United States Census of Agriculture, the supplementary program related to " ... sales of dairy products and sales of livestock, use of fertilizer and lime, farm expenditures, land-use practices, farm labor, equipment and facilities, rental agreements, farm values and farm mortgage debt." In the selection of the sample the lists of holdings were used. The enumerators had to put the serial number against each holding on the list and then collect supplementary data from each holding for which the assigned number ended in "2" or "7," viz., 2, 7. 12, 17, etc. In addition, the sample also included all holdings with estimated sales in 1959 of more than $100,000, or the total area of 1,000 acres or more.15 Classification of techniques illustrated The review in the previous section of the techniques used in the application of sampling methods for purposes of broadening the scope of complete enumeration censuses makes it possible to classify the techniques il1ustrated according to criteria which arc of basic importance in methodology. The first criterion will be the method used to select the sample. This criterion leads to two groups, i.e., techniques based on automatic selection of the sample, and techniques based on some special procedure of selection. A sample is said to be selected automatically when (a) the units belonging to the sample become known gradually as the process of enumeration continues; and (b) no particular action is taken to establish the frame before the enumeration itself starts. Examples of the technique belonging to this class are the samples used in the 1940, 1950 and 1960 United States Censuses of Population, the 1950, 1954 14
Data presented here are taken from United States Bureau of the Census. U.S. Census of Agriculture, 1954, Vol.3, Special reports, Part 12. Methods and procedures, Washington, D.C., U.S. Government Printing Office, 1956. 15 Further details are found in United States Bureau of the Census. U.S. Census of Agriculture, 1959, Vol. 2, General report: statistics by subjects, Introduction, Washington, D.C., 1962. 6
74
SAMPLING METHODS AND CENSUSES
and 1959 Censuses of Agriculture in the United States, the 1951 Census of Agriculture in Canada, etc. If, on the other hand, lists of holdings are prepared in a separate stage of field work and the sample is selected from among them so that units belonging to the sample may be estab lished prior to the commencement of the enumeration and the enumer ators know in advance which holdings will supply the supplementary information, it is said that such a sample was selected through a special procedure. Examples of this are the 1956 Census of Agriculture in France, the 1950 Census of Agriculture in Japan, the 1945 Census of Agricul ture in the United States, etc. The next criterion concerns the enumerators employed in the supple mentary program. Here again we distinguish two group of techniques. The first group consists of those in which the supplementary information is collected by the regular census enumerators. Examples are the 1940, 1950 and 1960 Censuses of Population in the United States, the 1956 Census of Agriculture in France, etc. Into the second group fall the techniques where special enumerators are trained or additional training is given to a number of regular census enumerators with a view to en abling them to collect data on the supplementary program. An example of this group is the 1950 Census of Agriculture in Japan. The results of this classification are given in a summarized form in Table 13. As will be seen, one cell is left blank. The reasons will be explained later. Let us now examine some of the principles governing the choice of the technique to be used in a concrete case. Clearly, the method of the auto matic selection of the sample has advantages over the use of a special pro cedure from the point of view of cost. It is sufficient to remember in this respect that the cost of the selection of the sample in the 1940 United States Census of Population was reduced, in practice, to preparing various styles of questionnaires and distributing them in a given order to enu merators. On the other hand, the cost of the preparation of files of farms, as in the case of either the French or the Japanese Censuses of Agriculture, must be considerable. It is not known what funds were needed for this operation but from experience with similar undertakings, one would be safe in saying that the expenditures involved were high. The use of automatic selection of the sample is likely to make this item unnecessary in the census budget. At this stage, however, one must add that the advantage of automatic selection is accompanied by one very serious drawback, in that the sample
75
BRON)ENING THE SCOPE OF CENSUS PROGRAMS
TABLE 13. • CLA�IFICATION OF THE TECHNIQUES USED IN BROADENING THE SCOPE OF COMPLETE ENUMERATION CENSUSES THROUGH THE APPUCATION OF SAMPLING METHODS
Method of selecting the sample Automatic
.s
bl)
e
t 0
�
os
0 (IS .,."O
c8 �
� c:
e ;� e
"O (IS
2
..... e
1�Cl e::s
(lj {I) 0
�85 G)
::s
f� .....Op. c.. ::s Cl
�
of AgriculFinland of AgriculFrance
1945 Census of Agriculture in the United States
5
1950 Census of Agriculture in Japan
8:9 e
Vl�
1950 Census ture in 1956 Census ture in
� .... b
0 >, CIS :; "O
.... o
1940 Census of Population in the United States 1950 Census of Population in the United States 1950 Census of Agriculture in the United States 1951 Census of Agriculture in Canada 1954 Census of Agriculture in the United States
Special 1>rocedure
;:i
is exposed to selection biases of unknown magnitude. If not properly dealt with, these biases can easily cancel out the advantages of this method of selection. The problem of biases, therefore, is crucial to the automatic selection of the sample and deserves our special attention. We therefore turn to this question and take up various aspects of the choice of the method of selection of the sample at a later stage.
Selection biases The first systematic study of the bias problem as it appears in this connection was made during the preparations for the 1940 Census of Population in the United States. This study has, in a sense, come to be regarded as a classic and helps us considerably to grasp the problem of biases. A summary of th� results achieved is presented in what follows. There are several factors in the process of enumeration as carried out in the field that may influence the results obtained from a sample of lines as selected in the above-mentioned 1940 Census of Population. One such factor is a pattern of enumeration which is usually introduced by enu-
76
SAMPLING METHODS AND CENSUSF.S
merators when they prepare the order of persons and dwelling units within the districts assigned to them. They do not make a random selection of households or dwelling units to establish that order because such a procedure would involve the existence of a previously prepared list of units and a good deal of walking. Instead, they usually start at a corner and proceed along the street until all the units are exhausted. Another factor leading to similar consequences is the pattern which is introduced if the enumerators start their enumeration in each house from the top floor and proceed to the bottom, or vice versa. Third comes the pattern introduced into the census by the prescribed order of enumeration within the households, by which the head is required to be enumerated first, then his wife, and after her their children, relatives and servants. All these patterns would be immaterial in the case of an independent random selection of persons for the sample within, say, each enumeration district. Such a selection, however, would not be an automatic one. With the system adopted in the 1940 United States Population Census, the meaning of these patterns was quite different. To make this clear, it will first be assumed that people Jiving in corner houses are different from those living along the rest of the street. In some countries, or at least in some cities, this assumption may be a realistic one, in the sense that apartments in corner houses are more expensive and, consequently, rented by more affluent people. In addition to this, it will be assumed that people living in top floor apartments are different from the others because the top floor apartments are usually more expensive and are again rented by people who can afford the difference in price. To make the il lustration more drastic, let us further suppose that a small sample of, say, 1 line out of 80 could satisfy the needs. If this line happened to be the first on each sheet, the first unit in the sample from each enumeration dis trict would automatically represent the head of a family living in a corner house on the top floor. Irrespective of who the persons are on other lines selected for the sample, it is certain that, after the assumptions made, such a sample would give biased results. Heads of families living in corner houses and top floor apartments would be overrepresented and this may influence the estimates of most of the census characteristics. By increasing the size of the sample, one could certainly reduce the magni tude of the bias present but, with the idea of sampling lines on the ques tionnaire, one cannot completely eliminate its danger in the case of mod erate sample sizes. Under such assumptions, one might expect that samples including line I will be different from those which do not in-
BROADENING THE SCOPE OF CENSUS PROGRAMS
77
elude this line. In other words, the system of selection of the sample is such that it is likely to produce biased results under certain conditions. The magnitude of this bias is not known in advance because it depends upon the existence of the patterns we have mentioned. For this reason, a study of the concrete situation is necessary, so as to show: (i) whether there is any bias to be expected as a result of the system of sampling adopted; (ii) if so, what are the sources of these biases; and (iii) what has to be done within the framework agreed upon to eliminate them? If such a study is not undertaken. the results of the sample may be unreli able for most practical purposes. The first result of such a study conducted by the United States Bureau of the Census16 concerns the existence of patterns in census results. An inspection of census material has shown that corner houses are dif ferent from other houses with respect to certain characteristics in a pop ulation census. As a consequence of this, “ household characteristics, such as rent, value of home, number of children, number of lodgers, vary from house to house in waves of distinct patterns with peaks or depressions at the corners." Thus, the “ corner influence" must be taken into account if the proposed system for the selection of the sample has to be used. Further, it was also found that enumerators disobeyed various instruc tions. They were required, for example, to fill in all the lines on each sheet. Some of them neglected this and left a few of the last blank when all the members of a family could not be enumerated on the same page. So they started with the head of this family on the next page, increasing in this way the proportion of heads of families on line 1. The summarized picture of these findings is presented here in Table l 4. It can be seen that the first sheets in all enumeration districts start with heads and continue with wives, children, etc., according to the instruc tions given. But the last 2 lines on the first sheet in the second district were left blank. The enumerator did not want to separate two chil dren from their parents and started on the second sheet with the head of this family. The final result of such operations is seen in the totals at the right-hand side of the table. It is instructive to see how the totals for particular symbols vary with the serial number of lines. 16 The results presented here are published in: Frederic F. Stephan, W. Edwards Dem ing, and Morris H. Hansen. The sampling procedure of the 1940 Population Census, Journal of the American Statistical Association, Vol. 35, 1940, p. 615-630, which is rec ommended to all interested in the subject.
TABLE
14. ·
HYPOTHETICAL ENUMERATION OF A CITY WITH 94 ENUMERATION DISTRICTS SHOWING VARIOUS CHARACTERISTICS BY SHEET AND LINE NUMBER l
._) 00
(Data on sheets between the second and last are not shown) 2
Line no.
7 8
.... 77
78 79
80
.. ..
ewe ..... ccc OC H ..... CCO ttcw ..... cco wcc ..... cHo CBC ..... HWH H B C ... '.. WC W
2
3
. . . . . 92 93 94
1
Last sheet in district 2
3 . . . . . 92 93 94
wHo ..... ccc cwo ..... ccH ccH ..... cow ocw ..... coH HHC ..... OBW wcc ..... OHC ccc ..... Hwc HOC ..... wcc
c ttc ..... woc owc .....·coc HCC ..... C OH WCH ..... OHW ccw ..... Hcc Hcc ..... wcc WH C ., .... C BC owc ..... C BH
c cc ..... cHc BBH ..... cwc HBW ..... HCC WBC ..... OCH C BH ..... 0 0 W C B W ..... OHC
BCO ..... B B B BHO ..... BBB B WB ..... BB B BBB ..... BBB B BB ..... BB B BB B ..... BBB
. ....
. ....
Totals for all sheets
H
W
C
O
B
327 231 252 256 249 254 259 257
208 305 215 227 229 220 223 230
378 384 442 426 431 438 423 420
87 80 90 88
0
81
7
231
208 205 212 209 202 211
397 395 395 392 398 391
71 69 64 73 67 73
238 233 229 235 227
. .....
0 1 3
85
6
77
9
83 10 93
93 96 97 98 98
1 This table is reproduced from Frederic F. Stephan, W. Edwards Deming and Morris H. Hansen. The sampling procedure of the 1940 Population Census. Journal of the American Statistical Association, Vol. 35, 1940. p. 615-630. with the kind permission of the Journal of
the American Statistical Association.
The symbols represent the following: H - Head of household, W - Wife of head. C - Child of head, 0 - Other member of household. B - Line left· blank. 3
>
.,,
�
z
a� 0 tn
> z 0
CEN U
75 76
... .
Second sheet in district
G
s
6
HHH ..... HHH www ..... www C HC ..... CCC cwc ..... HHC C C H ..... WWH o Hw ..... ccw HOO ..... C HC WOH ..... HWO
1
L
4
. . . . . 92 93 94
S
1 2 3
First sheet in district
1 2 3
en
�
BROADENING THE SCOPE OF CENSUS PROGRAMS
79
It is obvious that in this situation a simple, systematic sample of 1 out of 20 lines cannot be taken without running the risk of biases. A method was necessary, therefore, to eliminate the possibility of biases faced by the method of systematic selection. One way of doing so would be to find a procedure of selecting the sample in which all the lines would enter the sample with equal frequencies. It could be achieved by prepar ing 20 different styles of questionnaires. On the first of them the sample Jines would be 1, 21, 41 and 61; on the second 2, 22, 42, 62, etc., with the last one being 20, 40, 60, 80. On this basis, the real enumer ation should be organized in such a way that the first style questionnaire is assigned to the first enumeration district, the second style to the sec ond district, etc., carrying on the same type of distribution of schedules over city or county boundaries. The resulting sample, clearly enough, would not be free from the bias for very small administrative units but would certainly become so for large units like states or similar regions. Practical realization of this idea would not meet serious obstacles as far as the distribution of questionnaires is concerned. The total num ber of enumeration districts in each administrative unit is known before the enumeration starts, because the delineation of enumeration district boundaries, their mapping or a detailed description is an activity which is begun and finished, in any case, before the distribution of questionnaires to the enumerators becomes a problem. Nevertheless, the idea was not used. First of all, it was not found feasible to print 20 different ques tionnaires. Then it was also found that this idea could be simplified from the practical point of view. The study of the material available has shown that, in the case of applying the method described, some questionnaires would give results that would not differ practically from each other. This can be seen in Table 14 if the totals for different symbols are compared. It applies, for instance, to those styles of questionnaires on which the sample lines start with 9, 10, 11 and 12. In other words, there is a possibility of reducing the total number of styles of questionnaires and by so doing make the whole idea more attractive from the practical point of view. In fact, it was found that five styles of questionnaires are enough to get an "unbiased set of lines" into the sample. These styles will be called V, W, X, Y and Z. The styles W, X, Y and Z are used to represent, in the sample, the lines which are supposed to cause biases. These lines are: 1-6, 39, 40, 41, 42, 44, 46, 75, 77, 79 and 80. The re-
SAMPLING METHODS AND CENSUSES
80
1naining lines will be represented by style V. The lines relating to each particular style were finally the following: Style
Line numbers
x
14 29 55 68 1 5 41 75 2 6 42 77 3 39 44 79 4 40 46 80
v w y
z
For the last four styles, i.e., W, X, Y and Z, Jo lines were selected. Since each sheet was supposed to have 4 sample lines, these lines were selected for each style by arranging the eligible J 6 lines in the order in which they were found on the sheet and taking every fourth line for a particular style. A word or two may be necessary for the selection of the lines used for these four styles. Previous explanations make clear the existence of the " line bias " in connection with first and last lines on each side of some of the sheets. With the first lines this is so because they show on the average more heads of families and less children than the other lines. Further, they are rarely left blank. With the last lines it is so because they contain on the average a lesser number of heads than the other lines. Since it is not known how many lines at the beginning and at the end of each side of the schedule are affected by this kind of bias, it might appear better to cover, by styles W, X, Y and Z, the first 4 and last 4 lines on each side of the questionnaire, i.e., the lines numbered 1-4, 37-44, and 77-80. The procedure adopted emphasizes the beginning of the questionnaire (line 1-6), the beginning of the back page (lines 41, 42, 44, 46), and the end (lines 75, 77, 79, 80). At the same time, it puts less stress on the last lines of the front page (by se lecting lines 39 and 40). Such a selection of these lines was suggested by the study of the census material in which the patterns discussed came to light. This study shows, for examp1e, that the line bias in the begin ning of the schedule persists even beyond line 6. Another question may arise regarding the spacing between the lines selected at the beginning and at the end of the back page. One might expect biases to be better contro11ed by selecting the first 4 lines (41-44) and the last 4 (77-80) than by the selection in fact used (lines 41, 42, 44, 46, and 75, 77, 79, 80). According to the explanation given in the
BROADENING THE SCOPE OF CENSUS PROGRAMS
81
paper quoted, the selection adopted is justified by the need to control biases which might affect lines somewhat further from the first or the last 4 lines. Such lines are, for example, 45, 46, 75, 76. For this rea son, lines 43 and 78 were left out and 46 and 75 were included. This additional risk of biased results would otherwise be left uncontrolled. In selecting the lines for style V, line 14 was used as the first because the study of the census material revealed that it gives a satisfactory average of census characteristics for persons falling after line 6. The serial numbers of the remaining three lines were determined on the basis of the regression of various census characteristics on line number. For the lines which are left to be sampled for style V of the questionnaire, it was found that this regression can be represented by a smooth curve. This does not hold for the lines at the beginning, in the middle and at the end of the questionnaire, because of the existence of the patterns. It does hold, however, for the rest of the lines on which the varying sizes of households produce an effect similar to a random distribution of persons over these lines. By using a third degree equation to express this regression, it was possible to find the serial numbers of lines which will give a practically unbiased set of lines for the population of lines eligible for style V. These three additional lines were 29, 55 and 68.17 The remaining problem is how to distribute these five styles of question naires among enumerators. Style V has 64 lines, while the other four styles, W, X, Y and Z, have only 4 lines each. If equal frequency is to be given to each line of the sheet, it means that in every 20 ques tionnaires used 16 should be of style V and one of the other remaining four styles. It can easily be achieved in practice if the enumeration district as a whole were used for the enumeration with a questionnaire of a particular style, i.e., if an enumerator is given only one type of ques tionnaire. It will not produce the desired ratios: 16: 1 : 1 : 1 : 1 in the case of varying sizes of enumeration districts, but this is not considered very important in practice, because the influence of the patterns described is eliminated after some households have been enumerated. At this stage, any style used may be relied upon to give unbiased estimates. It is only important, therefore, that the enumerators start their enumeration on the style assigned. Another factor also contributes to this situation. This is the preparation of sample estimates for relatively large adminis trative units, which eliminates possible effects of small enumeration dis17 Details of the procedure used to arrive at these serial numbers will be found in
the paper quoted.
82
SAMPLING METHODS AND CENSUSES
tricts and, in general, the deviations from the ratios prescribed. For all these reasons, the actual distribution of styles on enumeration dis tricts was made in the following order: V, V, W, V, V, V, V, X; V, V, V, V, Y; V, V, V, V, Z; V, V. It was a rotation scheme which was repeated county after county, state after state, without changing the scheme. Another method of checking on the presence of selection biases consists in comparing the results obtained in the sample against certain known results of the complete census. For example, if the enumerators are asked to select every tenth person for the supplementary sample survey, the total size of the sample obtained should be equal to 10 percent of the total number of persons known from the census. In censuses of agriculture, it has been found that enumerators sometimes arrive at a correct size of the sample but change its structure, in the sense that some particular size classes are underrepresented and others overrepresented. The check on the presence of such biases can also be performed if the size classification of holdings has become known from the tabulation of the results of the complete enumeration census. A comparison of the size of the sample by classes and the corresponding percentage of the units falling within the same class according to the census shows the character of changes introduced in the composition of the sample. Some such procedure was used in checking the presence of selection biases in the 1950 United States Census of Population. In addition to a number of measures taken with a view to reducing the possibilities of enumerators to deviate from the procedure prescribed for the selec tion of the sample, district supervisors were required to submit a report containing data on the size of the sample selected from the areas under their charge and the total population obtained by the census as well. Twenty percent of the latter totals were then compared with the size of the sample. Results of a similar comparison of the size of the sample with the cor responding fraction of the population are given in Table 15. This table contains the frequency distribution of percentages representing the size of the sample with respect to the size of the population known from the census. In preparing this table, large cities were defined as those which had a population of 50,000 and over in the 1940 Census of Population. Such cities are excluded in this study from county data. The total number of counties and cities is 3,299, with 2,274 (or 68.9 percent) having sam-
83
BROADENING THE SCOPE OF CENSUS PROGRAMS
15. - FREQUENCY DISTRIBUTION OF THE SIZE OF COUNTY SAMPLES 1 (20 percent sample, 1950 United States Census of Population)
TABLE
Class
15.00-17.99 18.00-18.99 19.00-19.49 19.50-19.59 19.60-19.69 19.70-19.79 19.80-19.89 19.90-19.94 19.95-19.99 20.00-20.04 20.05-20.09 20.10-20.19 20.20-20.29 20.30-20.39 20.40-20.49 20.50-20.99 1
Number of counties and cities
Percentage of total
2 7 39 37 67 210 617 618 677 473 261 207 55 20 3 6
0.1 0.2 l.2 1.1 2.0 6.4 18.7 18.7 20.5 14.3 7.9 6.3 1.7 0.6 0.1 0.2
This and the following table have been reproduced from J. Steinberg and J. Waksberg.
Sampling in the 1950 Census of Population and Housing, Washington, D.C., U.S. Bureau of
the Census. 1956, p, 23.
ples of less than 20 percent and 1,025 (or 31.1 percent) having samples of 20 percent and over. The significance of these deviations from 20 percent of the corresponding population 1s presented in Table 16. TABLE
\ 6.
DEVIATIONS OF COUNTY AND CITY SAMPLES I-ROM 20 PERCENT OF THE CORRESPONDING POPULATION EXPRESSED IN STANDARD ERRORS
-
(20 percent sample, 1950 United States Census of Population) Size of deviation
Under ,, ,, ,,
I standard error errors 2 ,, ,, ,, 3 ,, ,, 4 Total
Number of. �ounties and c1t1es
Percentage of total
2 081 2 894 3 150 3 239 3 299
63.1 87.7 95.5 98.2 100.0
It is seen that in some counties and cities the size of the sample differs significantly from 20 percent. In other words, a slight bias is present in the results of this sample. With respect to other characteristics, it is somewhat larger than in connection with the total number of persons. The source of the bias was found to be the failure of enumerators to follow exactly the instructions given.
84
SAMPLING METHODS AND CENSUSES
These are some typical studies of the presence of selection biases. The problem will be taken up later in the chapter "Adjusting sample re sults," where some methods for removing these biases will be discussed. Choice of technique for selection of the sample
The results discussed show a number of ways in which the selection bias can appear in an automatically selected sample. This is an im portant factor bearing on the decision as to what method should be used for selection in a particular case. From this point of view, the use of a special procedure for selecting the sample is safer, although no selec tion procedure is free, a priori, from biases. The files of holdings may not be complete and mistakes may occur during the preparation of lists for the selection of the sample. If the sample is selected through a special procedure, however, the work involved is done by a relatively small group of people specially trained for the purpose and this reduces the possibility of biased samples. In the 1945 Census of Agriculture in the United States, the use of the Master Sample meant automatically an unbiased sample owing to the procedure adopted in its selection. In addition to the bias problem there are a large number of factors which will influence our decision. One of the first facts to be considered in this respect is the composition of the supplementary sample survey program and the type of items included in it. From this point of view, the number of these items may be important. If there are only a few of them and the complete enumeration program is not very large, regular census enumerators can be expected to collect sufficiently accurate infor mation on the sample survey program. Then, a long list of questions may cause fatigue in both enumerator and respondent. Since the com plete enumeration census is usually taken first, the consequence of such fatigue may be the poor quality of the data collected for the supple mentary program. Another all too well-known consequence of lengthy questioning is a gradual loss of interest on the part of both enumerator and respondent. If the sample survey program is long and cannot be shortened in any way, it might be reasonable to consider the possibility of employing special enumerators who are themselves interested in the survey and have the necessary skill to keep the respondent's interest alive until all the questions on the program are exhausted. Another problem is the type of information included in the sample
BROADENING THE SCOPE OF CENSUS PROGRAMS
85
survey program. If we take into consideration the sample census pro gram of the 1940 Census of Population, we shall see that it involves items on which any respondent can easily give accurate answers at once, without making special efforts or losing his time by going through his records. Such questions are: What language did this person speak in his home in ear]y childhood? Is the person a veteran? The wife, widow or under-18-year-old child of a veteran? Has this person been married more than once?, etc. The facts asked in these questions are basic for any individual, so that it can be taken for granted that accurate answers are known to everybody. In the case of such items on the sample survey program, there is again no obstacle in principle against automatic selection and the use of regular census enumerators. No particular requirements as to their ability and training are here imposed on enumerators. In this respect the situation is very different with sample programs mentioned earlier in connection with censuses of agriculture. The sample survey programs of the 1950 Censuses of Agriculture in Finland and Japan are not only large in terms of the number of items included; they may also be called difficult from the point of view of the possibility of getting accurate answers for even some of their items. In the Japanese supplementary sample survey, in the group of items on labor force, a special question was asked to indicate the persons engaged in agricultural work on a farm among a11 those who lived on the farm. This may give rise to the query as to what kind of work is considered agricultural and to what extent does a person need to be engaged in a given type of work to belong to the class of people of interest to the census. The same holds true for the questions on the number of persons engaged in activities other than agricultural. Here again the problem appears as to how long they have to work for another branch of industry to be included in that group. The number of similar difficulties regarding questions on people living on farms is not small and therefore enumer ators are supposed to be very well informed of how to proceed in a particular situation. In this census, there was a list of more than 100 crops, for each of which data were required on the area harvested, total yield, quantity sold and value obtained. A good part of the difficulties known so far in eliciting accurate answers can very easily be linked with these ques tions. It is impossible to expect all holders to keep records on so many details regarding the activities of their holdings. Even if they do, would it be feasible to make use of such records? In the more realistic assump-
86
SAMPLING METHODS AND CENSUSES
tion, i.e., if these records do not exist, the enumerators must exercise infi nite patience in checking the consistency of the data obtained. They also have to be very familiar with agriculture in general and the local con ditions of farming in particular, to be able to discard some answers and ask for additional explanations. It appears then that specially trained enumerators are needed where there is a long list of items on the sample survey program. Such a program usually refers, at least in part, to facts which are not neces sarily known to the individuals concerned. Many of these items are difficult questions, in the sense described above. The total number of children born to a woman, for example, or the area harvested, the total yield, the quantity sold and the price obtained for pearl millet by an agricultural holding in the past year illustrate facts which do not rank equal in a person's memory. The first is very clear, while the second is vague and, in many cases, confused with a number of other similar facts. In the first case, many enumerators can be assumed sufficiently capable of obtaining an accurate answer. In the second, the enumerator has to be very well trained, patient, familiar with the field of work and possess special qualities and abilities. In such a situation, it might be difficult to expect that a large number of enumerators could be success fully used for providing information on a sample survey program. In other words, if the supplementary program is composed of difficult ques tions, automatic selection of the sample may not be appropriate. Another important fact in connection with our problem is obviously the organization of the statistical services. In countries where a per manent field staff is engaged, and particularly where relatively small ad ministrative units are assigned their own field workers who are residents in these units, there are, generally speaking, better opportunities for using specially trained enumerators than in countries with a highly centralized statistical service without permanent field staff. The presence of the staff in the field eliminates one of the basic obstacles to the use of the special procedure in selecting the sample, that is, the high cost of travel expenses and allowances. The last point to. be considered in dealing with the choice between these two procedures is the connection of the use of sampling methods for the purpose of broadening the scope of the tabulation programs. This is an extremely important point that has to be examined separately before any decision is taken with respect to the use of the procedures discussed here. Details on this subject will be found in Chapter 7.
6. THE USE OF SAMPLING METHODS IN TABULATION 1
Advance estimates It has already been pointed out that a broad interest in statistical data is characteristic of our times. Mention has also been made that, as a result of such interest, there has emerged the need to see that in the future such data should be timely and should of necessity embrace a much wider scope than has been the case heretofore. The problem of the scope has already been discussed. The question of timeliness is connected with prac tice. Apart from their historical and purely scientific aspects, census data are closely linked with practical activity. If published after a long delay, their - practical value is naturally reduced. This in itself constitutes a new problem in census methodology, and an attempt to find a solution to it can be seen in the recent censuses of many countries. Some decades ago, the problem of speed did not exist in the form we are acquainted with today. The preparation of time tables in planning the tabulation of censuses was pursued largely independ ent of practical needs. As an extreme example, we may mention the 1921 Census of Population in Yugoslavia, which was tabulated by hand, in such a way that the tabulation was not completed by 1931 when the new census was taken. This census was again tabulated manually and the process was still incomplete in 1941 when war broke out and the entire census material was destroyed. 2 1 The use of sampling methods for purposes of tabulation started relatively early. Some work along these lines was done by Kiaer, then by Russian statisticians, in Italy, Bulgaria, etc. The reader who is interested in this early work might wish to consult: 0. Anderson. Ueber die reprasentative Methode und deren Anwendung au/ die Aufbe reitung der Ergebnisse der bulgarischen landwirtschaftlichen Betriebszaehlung vom 31. Dezember 1926, Munich, Deutsche Statistische Gesellschaft, 1949; C. Gini and L. Gal vani. Di una applicazione del metodo rappresentativo all'ultimo censimento italiano della popolazione (1 dicembre 1921), Annali di Statistica, Serie 4, Vol. 4, 1929; You Poh Seng. Historical survey of the development of sampling theories and practice, Journal of the Royal Statistical Society, Vol. 114, 1951, p. 163-195; S.S. Zarkovich. Note on the history of sampling methods in Russia, Journal of the Royal Statistical Society, Vol. 119, 1956, p. 336-338. These early developments are not taken into account in this book. The use of sampliqg methods in tabulation is presented in the light of recent needs and practices. 2 S.S. Zarkovich. Sampling methods in the Yugoslav 1953 Census of Population, Journal of the American Statistical Association, Vol. 50, t 955, p. 720-737.
88
SAMPLING METHODS AND CENSUSES
In censuses of population these urgent needs are partly satisfied by the preliminary reports on the census count. When the enumeration is completed, the enumerators prepare summaries on their respective dis tricts and declare the total number of people enumerated and certain other basic facts. In this way, the most important census results arc known by small administrative units very shortly after enumeration. The basic result of a preliminary report is the number of units enumer ated. This is probably the most important information in censuses of population. It is not so in censuses of agriculture. A preliminary re port containing only the number of holdings would not be of much use. In agricultural censuses �ther items are at least equally important, such as total area of holdings, areas under different crops, production of these crops, and number of livestock. The preliminary report cannot cover all these items and, if data are needed urgently, the only solution may be to prepare advance estimates of basic census results. The advantages of using samp]ing methods in the preparation of ad vance estimates are obvious. In addition to achieving savings, this is the wa); of putting advanced estimates on a scientific and objective basis. The preparation of advance estimates is primarily a problem for pop ulation censuses. An explanation of this may be the wider interest in the results of population censuses than in the data of other censuses and the need for faster action in the field of population than in others. Recent experience also suggests that the use of standard modern machinery for mechanical tabulation cannot satisfy urgent needs for census data. ft does, in fact, make the tabulation speedier but, at the same time, gives rise to an increasing number of requirements. The introduction of more efficient machines has thus been followed by larger and more diversified programs. During the preparations for the 1951 Census of Population in Great Britain it was found that the use of modern machinery could not shorten the delay in the publication of census .data estimated at four years from the initial enumeration.:i If in a situation of this kind something has to be done to satisfy urgent needs, it would only be possible by reducing the bulk of the censu� material. In other words, the only way was to use sampling methods to prepare the advance estimates. Advance estimates, hO\vever, will one day become unnecessary if pres ent processing techniques are made more efficient. The progress achiev3 United Kingdom.
General Register Office.
cent sample tables, London, H.M.S.O., 1952.
Census 1951, Great Britain one per
THE USE OF SAMPLING METHODS IN TABULATION
89
ed so far in the use of electronic computers and a number of other de vices in data processing is very promising 4 and it may easily be that, in the near future, the completion of processing of a census in a large country will be fast enough to satisfy even urgent needs. In this case, less interest in the advance estimates may be expected, always provided the definition of "urgent needs" remains the same as it is today. Advance estimates prepared in a number of recent censuses vary con siderably as to their scope and the type of purposes they are designed to serve. The program of work for their preparation obviously depends upon a wide range of needs and the number of details required. Then comes the question of the time available for tabulation, the type of facil ities, the variability of characteristics on the program, the precision aimed at, the size of administrative units by which data are expected to be presented, and access to the census material. An example of a situation where, at one extreme the speed factor was of primary importance and where, in consequence, the advance estimates were reduced to a few items is to be found in the 1949 Sample Census of Population in Poland, which was taken for the specific purpose of meeting certain urgent needs. On this occasion, there was also a particular in terest in the age classification of the population. So a subsample was selected from the initial sample and the desired classification prepared. 5 At the other extreme are the advance estimates prepared for a larger number of items, where more time is available for tabulation. The il lustrations presented in the next section will give some idea of this type of more systematically prepared advance estimates and some typical prob lems met with in this work. mustrations
Our first example refers to the advance estimates of the 1950 Census of Population in Japan. 6 The preparation of advance estimates was divided into two parts. The most urgent needs were assumed to be met 4 Cf Joseph F. Daly and Morris H. Hansen. Data processing on electronic computers in the United States Bureau of the Census, Bulletin of the International Sta tistical Institute, Vo1. 36, 1958, Pt. 4, p. 376-387. 5 Stefan Szulc. The sample census of population in Poland, 1949, Population Studies, Vol. 4, 1950, p. 112-114. e This presentation of the use of sampling methods for preparing the advance esti mates in the 1950 Census of Population in Japan is based on Yuzo Morita. Sam pling tabulation of the 1950 Population Census in Japan, Bulletin of the International Statistical Institute, Vol. 33, 1951, Pt. 4, p. 47-54. 7
90
SAMPLING METHODS AND CENSUSES
by the so-catled first advance estimates, which were concentrated on a short program of tabulation. The second advance estimates were pre pared for a broader tabulation program. For the purposes of this census, Japan as a whole was divided into about 370,000 enumeration districts, of which three different types were distin guished: (a) special enumeration districts (sEos) which contained (i) five or less households, with an average size of 20 persons or more; and (ii) one or more households, with 100 persons or more; (b) divisible enu meration districts (DEDs) or those which had more than 500 persons; and (c) ordinary enumeration districts (oEos) or all others. These were designed in such a way as to include an average of 50 households. The enumeration was carried out between 1 and 3 October. In prin ciple, each district was assigned a separate enumerator. When the oper ation was over, the editing of the questionnaires was performed in the field and the entire material sent to the Statistics Bureau. On the basis of the enumerators' reports, the preliminary count was released on 28 December, placing the total population at 83,196,000. The second count was released on 28 February 1951, with a total of 83,199,637 persons. In previous population censuses of Japan, five to eight years were usually needed to tabulate census data so that the information required would be available to smaller administrative units. To avoid such a delay, in this instance, and because up-to-date population data were im portant owing to the effects of war, it was decided to tabulate the census according to the following plan. 1. First advance estimates. To obtain basic information for the country as a whole on characteristics such as age, sex, marital status, labor force status, school attendance, housing. For preparing these esti mates a 1 percent sample was used. The tabulation was finished· by the end of April 1951 and results were released in May and June. 2. Second advance estimates. These aimed at providing the sc).me in formation as before, but for prefectures and large cities. In addi tion, a number of tables were introduced into the plan for this tab ulation (since it was very broad and included approximately 70 tables) that did not constitute the part of the plan for the complete tabulation. Here we have an example of the use of sampling methods in tabula tion which goes beyond the level of information required to cover urgent needs. For .the preparation of these tables a 10 percent
91
THE USE OF SAMPLING METHODS IN TABULATION
sample was used. The task was completed by the end of September 1951 and the results were planned for release in the spring of 1952. 3. The complete tabulation, in connection with which the mechanical work was expected to be finished by March I 953. The sampling unit used was the enumeration district. Individual per sons would have been more efficien� units where variance was concerned, but access to data for individual persons was an extremely complicated affair, while access to the enumeration districts was much easier. In such a case, if the enumeration districts are used as sampling units, the organization of processing and tabulation can be easily prepared in such a way that there is no duplication of work or loss of time: processing of sample enumeration districts can start first, and when the tabulation is finished the cards used are simply placed in their normal order. In addition, in the Japanese census that we are using as an example, the enumeration districts were designed to have an average of 50 households with the variation between 30 to 70 households. That is, the variation between the enumeration districts was kept to a somewhat moderate level, an important consideration in the quest for precise estimates for many census characteristics. Taking the enumeration districts as sampling units thus offered several advantages. Among these we would draw attention to the possibility of working on sample estimates without at the same time disturbing the normal flow of the census material. Ob viously, these advantages refer in their fuller sense to the ordinary enu meration districts. The other districts had to be modified to suit the range of variation indicated above. Consequently, the divisible enu meration districts were subdivided according to the criterion presented in Table 17. The figures on the left indicate the sizes of enumeration districts which have to be transformed, while the corresponding figures on the right show the number of OEDs to be created. TABLE
17. -
CRITERION FOR DIVISION OF ENUMERA110N DISTRICTS
(1950 Census of Population, Japan) Initial size of enumeration districts (persons)
500-749 750-999 1 000-1 249 for each additional 250 persons
To be transformed into:
2 3 4 add one more
OED
92
SAMPLING METHODS AND CENSUSES
In the next stage of the work all the enumeration districts available were numbered serially throughout the country and each consecutive 100 districts were made to form one pack. The sample of 10 percent was selected in such a way that ten enumeration districts were selected at random from each pack, and the first of each of these selected dis tricts was used to form the 1 percent sample. This explains how the sample was drawn from the population of ordinary enumeration districts and the divisible ones transformed into OEDs. The same procedure was not possible with SEDs and for this reason the sample of individuals was drawn from all the special enumeration districts. For a proper understanding of the procedure, it should be added that the enu meration in this census was performed on an interview basis. The enu merators contacted heads of households and entered data obtained for all the members on sheets with 60 lines each. To get a 10 percent sample of persons from each sheet, 6 lines were selected at random with the restriction that each consecutive 10 lines had to be represented by 1 line selected for the sample. In order to avoid biases during the selection procedure, which were likely to appear if the selection were left to inade quately supervised administrative personnel, a special table was prepared with the indication as to what lines were selected (in terms of their serial numbers). For the 1 percent sample, every tenth of the selected lines was taken into account. As far as the estimation procedure was concerned, in the case of the 1 percent sample, the simple unbiased estimate was applied, i.e., the sample totals were multiplied by 100. In the case of the 10 percent sample the ratio method of estimation was used. During the prepa ration of these estimates, population totals were known for cities (shi) and rural areas (gun). Since the time was available, these totals were also used to adjust sample estimates to agree with the final count. These adjustments were not made on the first advance estimates because there the primary concern was speed. On the other hand, in the case of the 10 percent sample it was found to be very important to adjust sample estimates because (a) they are considered as final; and (b) some of the tables produced here remain the only data available from the census. Any difference between the known census totals and sample estimates might produce confusion. In estimating the variances of the results obtained, only the variation between the ordinary enumeration districts (including transformed DEDs) was taken into account. The SEDs were disregarded because they rep-
THE USE OF SAMPLING METHODS IN TABULATION
93
resented only 1 percent of the total population. Besides, the system of sampling used for SEDs, where persons were selected instead of enumera tion districts, was more efficient than in the former case, so that the pro cedure used did not lead to an underestimation of the total variance. The between-OED variance was, in fact, computed on the basis of 90 groups of four adjacent districts selected systematically among those which constituted the 1 percent sample. In other words, a total of 360 enumer ation districts was used to estimate the between-OED variance for the characteristics used in the tabulation program. In the IO percent sample estimates it was important to show sampling errors by shi and gun. For this purpose, the selected 360 enumeration districts were clas sified according to whether they belonged to shi or gun; 120 were found in the former and 240 in the latter. As before, they were used to estimate the between-OED variances within shi and gun respectively. To have some idea of the magnitude of sampling errors obtained in the first advance estimates, Table 18 is reproduced here as part of a similar table in the paper quoted. It can be seen that the three types of errors are very much the same. On the other hand, owing to the fact that the estimates concerned were obtained from a sample of 1 per cent of enumeration districts, the magnitude of the standard errors can be considered as satisfactory for most of the practical purposes that the first advance estimates have to serve. The corresponding sizes of stand ard errors for a IO percent sample are obtained by dividing figures in this table by f 10. In connection with these two samples, attention should be drawn to the benefits obtained by the system adopted for selecting the sample. The OEDS established after the transformation of DEDS were serially numbered throughout the country and then the sample was taken out of each pack of 100 adjacent districts. It is as if a high degree of area stratification had been performed before the selection. It is not known what gains were realized by such a procedure. For some characteristics, such as labor force status, school attendance, housing, etc., the type of selection used could be extremely important. In similar situations, i.e., when estimates are prepared by small areas, the advantages of the system of selecting sample units which is based on a high degree of stratification, as it is in this case, may be considered indispensable. A similar type of advance estimates can be found in the 1953 Census of Population in Yugoslavia. In the postwar period many far-reaching changes occurred in this country, such as considerable shifts of popu-
94
SAMPLING METHODS AND CENSUSES TABLE
18. -
STANDARD ERRORS BY SlZE OF ESTIMATES IN THE
1
PERCENT SAMPLE
(1950 Census of Population, Japan) Size of estimates
Standard errors All Japan
All shi
All gun
. . . . Thousand . . . . . . . . . . . . . . . . . . . . . . . . . . Percentage . . . . . . . . . . . . . . . . . . . . . . 0.8 80000 0.8 60000 1.0 0.8 40 000 1.4 1.1 1.2 20000 1.5 1.4 1.3 10000 1.6 5000 1.6 1.3 1.7 1.6 1.7 2000 2.0 2.0 2.0 1 000 3.0 3.0 3.0 500 4.0 4.0 4.0 200 5.0 5.0 6.0 100 7.0 7.0 8.0 50 10.0 10.0 10.0 20 15.0 13.0 15.0 10
− − −
− −
lation from rural areas to cities, and a drastic alteration in the socio-economic characteristics of the population. The census of 1948 was taken to get a picture of the postwar population situation but the census of 1953 was also necessary because of the considerable changes known to have oc curred since the previous census. The census was taken as of I April 1953 and advance estimates were needed in order to get an initial idea of the magnitude of changes in literacy (as the result of educational pro grams throughout the country to eliminate illiteracy), degree of educa tion, economicaJly active population, and so on. To facilitate this census, the country was divided into 118,999 enumera tion districts with 89, 194 rural and 29,805 urban districts, for which the average size was 142 and 136 or 162 persons respectively. The enu meration districts were highly equalized in their size because each was assigned a separate enumerator and it was desired to keep enumerators engaged in an approximately equal amount of work. In this situation a systematically selected sample of 100 enumeration districts in urban areas and of 149 districts in rural areas was used for purposes of checking the completeness of enumeration and the response errors in answers to census questions. The same sample was used to get the first advance estimates. The field inspectors who were employed for this work were instructed to prepare the totals for various characteristics in their re spective enumeration districts. On the basis of these totals, the estimates were drawn up by using the ratio method of estimation.
THE USE OF SAMPLING METHODS IN TABULATION
95
The first estimates were established to give a rapid assessment of the approximate magnitude of those census characteristics which were consid ered subject to sharp changes. To obtain further precise estimates of the same characteristics and the number of other estimates and cross classifications, the second advance estimates were also prepared. The design for this work had to respect two principles: (i) the disturbances of the regular processing of census data by embarking upon sample tabulation to be reduced to a minimum; and (ii) each of six republics to have approximately equal precise estimates. In connection with the first principle, the study of relevant facts has shown that enumeration districts are the most convenient first-stage, and households second stage units. It was found here, as in Japan, that the use of enumeration districts as the first-stage units makes possible an independent process ing of the sample. In addition, they were easily accessible: all the enu meration districts were serially numbered and filed in the order corre sponding to the serial numbers, so that there was no difficulty in locating any of them. On the other hand, the households were easily accessible within the enumeration districts, which cannot be said for persons, because their questionnaires (each person was enumerated on a separate schedule) were kept together within the household questionnaire. It was found more efficient to include in the sample the household as a whole (all the persons belonging to the household selected) in the second-stage selection than to use persons and embark upon the costly complications connected with their selection. The second principle made it necessary to vary the sampling fraction of the first-stage selection by republics. The second-stage fraction was fixed at 10 percent of households in all the selected districts. 7 Table 19 contains data on the population and the sample by republics. As has been explained, all the estimates prepared at this stage were given by republics. The first volume containing 63 tables was published in August 1954. The second volume 8 containing 60 tables was published in October 1954, and the third 9 with 30 tables appeared in November 1954. Many of the tables published here will not, in fact, be included 7 Data used for this presentation are taken from the introduction prepared by M. Blejec for the publication Ekonomska Obelezja Stanovnistva [Economic characteristics of the population], Belgrade, Federal Statistical Office, 1954. Statistical Bulletin No. 28. 8 Vitalna, prosvetna i socija/na obelezja stanovnistva [Vital, educational, and social characteristics of the population], Belgrade, Federal Statistical Office, 1954. Statis tical Bulletin No. 29. 9 Domacinstva [Households], Belgrade, Federal Statistical Office, 1954. Statistical Bulletin No. 30.
�
TABLE 19. • SIZES OF SAMPLE OF DIFFERENT UNITS BY REPUBLICS (Second advance estimates, 1953 Census of Population, Yugoslavia)
Republic
Po1>ulation
44654
Households
Sample
Number
9 3 87
Percent
21.0
Po1>ulation
1 63 8 989
Persons
Sample
Number
32 715
Percent
2.0
Po1>ulation
6 983 544
Sam1>le
Number
Per-
146 563
2.1
cent
28 864
4124
14. 3
1019 678
14 209
1.4
4 460405
62 879
1.4
Voivodina
429088
12 197
2. 5
1 713 905
42 714
2.5
25.0
5 281
2 63 7
50.0
127 223
6 309
5.0
809 23 4
40 970
5.1
Croatia
28 560
4 124
14. 3
1030 3 24
14 556
1.4
3 913 753
55 953
1.4
Slovenia
14 100
3 525
25.0
403 194
10 096
2.5
1 462 961
3 7017
2.5
Bosnia and Herzegovina
21 260
4 252
20.0
572 456
11080
2.0
2 845 486
55 743
2.0
Macedonia
7 588
2 515
3 3.1
248 730
8 177
3 .3
1 303 906
43 459
3.3
Montenegro
3 668
2 446
66.6
93 297
6 121
6.6
419 625
28 003
6.7
1 119 830
26 205
21. 9
3 986 990
82 745
2.1
16 972 275
3 66 73 8
2.2
Total
1 The total number of enumeration districts as presented here is slightly different from figures mentioned before in connection with the preparation of first advance estimates. The difference is due to divisions of enumeration districts 1>erformed during the preparation of the second advance estimates for the purpose of improving the efficiency of the method used.
§
tn
iz ztng
U
2 626
A D
10 509
Alano
t:z
0
METH
Serbia proper
SAMP
Serbia
Enumeration districts
I�
97
THE USE OF SAMPLING METHODS IN TABULATION TABLE 20. • PERCENTAGE ERRORS FOR SELECTED CENSUS ITEMS (Second advance estimates, 1953 Census of Population, Yugoslavia)
Items
Yugo- Serbia Croaslavia tia
.................. Age: 0-14 Age: 63 and over Reads and writes Number of agricultural holdings of the s ize: 0.01-1.00 hectare 10-15 hectares Number of households with the head employed in: agriculture commerce
Slovenia
Bosnia and Mace- MonHer.zego- donia tenegro vma
Percentage . . . . . . 0.9 0.6 1.9 2.4 0.3 0.8
0.3 0.7 0.3
0.6 1.1
0.6
0.9 1.7 0.5
2.3
1.1
1.7 3.3
2.9 7.7
4.5 4.4
1.2 3.1
0.8 4.8
4.7 4.7
1.8 15.7
. . . . . . . . . . .. . , 0.6 0.2 0.2
0.6 1.5 0.6
2.8 6.0
2.7 12.4
2.1 8.2
1.4 8.7
1.3 7.6
1.1 6.5
in the complete tabulation. Since the total population was known, the ratio method of estimation was used. The standard errors of the es timates prepared vary according to characteristics and cell frequencies. A general idea of the magnitude of sampling errors in the estimates pre pared c an be obtained from Table 20, in which perc entage errors are presented for selected items. The practic e of preparing advance estimates has also bec ome very popular in recent censuses in the Federal Republic of Germany. In the 1946 Census of Population, two different samples were selec ted to pre pare two different series of advance estimates. The first sample consisted of 1 percent of artificially formed clusters of about 100 persons each. The selec tion of c lusters was performed ac cording to a systematic pattern within each district in order to ensure the spread of the sample. Equal ization of c lusters was carried out for the purpose of improving the pre cision. Within the clusters selected there was no sub-sampling. The estimates obtained were available six months after the c ensus day.10 10 Data on these estimates are taken from F. Hage. Die repräsentative Auszählung zur Volkszählung 1950, Wiesbaden, Statistisches Bundesamt, 1953; Hans Kellerer, Das Stichprobenverfahren, insbesondere in der amtlichen Statistik, Allgemeines Sta tistisches Archiv, Vol. 34, 1950, p. 291-302; Hans Kellerer. Neuere Stichprobenver fahren in der amtlichen Statistik unter besonderer Berücksichtigung amerikanischer Erfahrungen, Allgemeines Statistisches Archiv, Vol. 33, 1949, p.83-112; Hans Kellerer. Stichprobenverfahren in der amtlichen deutschen Statistik seit 1946, Bulletin of the International Statistical Institute, Vol. 32, 1950, Pt. 2, p. 245-255; Walter Swoboda. Die repräsentativen Auszählungen zu den Volkszählungen 1946 und 1950, Zeitschrift des B ayerischen Statistischen L andesamts , Vol. 89, 1957, p. 139-151; K. Szameitat
98
SAMPLING METHODS AND CENSUSES
This sample had to meet the urgent exigencies resulting from war changes. The need for more details and more precise data than previously was met by the second advance estimates, which were made available three months after the release of the first estimates. The sample used for its preparation consisted of I percent of persons selected systemati cally from the punch cards. The cards selected on the sorter were dupli cated, so that the processing of the sample was made independent of the complete tabulation. In the 1950 Census of Population, there were also two samples. The first consisted of 1 percent of households selected systematically, with all the persons belonging to them included in the sample. The tabulation of the results was ready four months after the census day. The second sample consisted .of 1 percent of punch cards also selected systematically and duplicated, so that it was possible to work on the tabulation of advance results independently of the main processing. This sample was also used for various studies of the applicability of sampling methods in the field of population characteristics and may be referred to as the source of additional information if the need arises in the future for any tabulations outside the adopted program. The next illustration comes from the 1951 Census of Population in Great Britain, where advance estimates were planned in an interesting way from the point of view of methodology. It should also be noted that this case represents an illustration of advance estimates in the proper sense of the word, combined with the use of sampling methods for purposes of broadening the scope of the tabulation program. In the introduction to the publication which contains these advance estimates and a description of the methodology of their preparation, 11 the needs for early census results in Great Britain were described as fol lows. The type of tabulation being planned in this census required a period of three or four years after the enumeration before the census data could be made available for public use, that is, approximately the and C. Mcyrich. Reprasentative Erhebungcn und Aufbereitungen in der amtlichen Statistik, Wirtschaft und Statistik, Vol. 4, 1952, p. 141-145. Swoboda's paper contains a long list of the German literature on this subject Further information on the use of sampling methods in the German official statistics can be found in Stichproben, Wiesbaden, Statistisches Bundesamt, 1960. Additional examples from the field of population censuses can be seen in United Nations. Handbook of population census methods, Vol. I, New York, 1958. 11 United Kingdom. General Register Office. Census 1951, Great Britain one percent sample tables, London, H.M.S.0., 1952. AU the data on the use of sampling methods in this census, unless otherwise stated, are taken from the introduction to this publication.
THE USE OF SAMPLING METHODS IN TABULATION
99
same time as in earlier censuses. In other words, ''the position was not materially altered by the introduction in 1911 of high-speed tabulat ing machinery; its power and flexibility merely stimulated the demand for more detailed and more complex cross-tabulations and, without in any way diminishing the laborious time-consuming operations involved in the examination of the many millions of census schedules, and the con version of descriptive matter into numerical codes, it introduced the additional hand process of transferring the written records to separate punched cards." At the same time, the needs for census data were found to be ''more urgent than ever before," thus making the advance es timates indispensable. When it was decided to include the preparation of advance estimates on the program, the question of length of time arose. A decision was reached to allow a period of one year for the purpose. To keep to this deadline (in advance of which the same personnel had also to complete many other tasks connected with the census), it was decided to choose between a 1 or 10 percent sample of questionnaires as a basis for the advance estimates. These two sample sizes are partic ularly convenient for estimating totals because the sample totals have only to be multiplied by 100 or 10 respectively. Owing to the fact that in the 1951 census 48.8 million persons were enumerated, it was found that a 10 percent sample would be too large. This was particularly so because the preparation of advance estimates was the first work done by the personnel engaged for the purpose and low efficiency was therefore. to be expected. On the other hand, the sample of 1 percent was considered too small for a number of estimates and cross-classifications requested, but it was accepted because this was approximately the size that could have been handled within the allowed period of one year. To understand the selection of the sample, some details will be given first on the general organization of the census. For purposes of regis tration of births and deaths, the country was divided into areas that were used as the local census areas. In England and Wales there were 1,275 such areas with an average population of approximately 36,000; in Scotland 1,026 and 5,000 respectively. In each of these areas, local census officers were required first to identify and list large institutions, each containing 100 or more people. These were made special enumeration districts (sEos). In these districts, 1,357,000 persons, or 2.8 percent were enu merated, with an average of 400 people per SED. The remainder of the households was used to establish ordinary enumeration districts (OEDs),
100
SAMPLING METHODS AND CENSUSES
the boundaries of which had always to lie within those of corresponding administrative units so as to make possible the preparation of statistics by these units. The size of enumeration districts was different in rural areas from that in the dense urban areas, but the principle was always followed that each enumeration district required the services of a single enumerator. In England and Wales it gave the total of 49,318 OEDs with an average of 270 households and 860 persons. In Scotland the total of OEDS was 9,730 with an average of 150 households and 510 persons. The enumerators started work by traversing their respective enumer ation districts, listing the dwelling units visited and leaving blank schedules to be filled in. Later, they collected the schedules and numbered them serially from 1 onward. Each of these schedules covered one household. In addition, the enumeration districts were also numbered serially within each local census area so that even and odd enumeration districts were usually adjacent. Within srns the schedules for individual persons were numbered in the same way. On the basis of such an arrangement, the sample of households was taken in OEDS and a sample of persons in SEDS. In OEDS designated with odd serial numbers, the sample included all the households, the serial number of which ended in 25, within a particular enumeration district, or, in other words, the households numbered 25, 125, 225, etc. In even-numbered OEDs, the households selected were those ending in 76, i.e., 76, 176, 276, etc. The same procedure was used in SEDs, with the difference that here it was persons who were selected. For the units drawn into the sample the enumerators were obliged to provide copies of the corresponding questionnaires. These copies were assembled within each local census area and sent to the central office. In such a method of selecting the sample it can be seen, firstly, that the design has exploited very carefully the benefits of a high degree of stratification. 12 Here every enumeration district was used for the selec tion of the sample, so that it was spread over the whole territory, divided as it was into a large number of small units. In addition, a systematic selection of households or persons whithin each enumeration district was further undertaken in order to increase the spread of the sample. A particular reason for extending the degree of stratification so far was the plan adopted for the presentation of data according to which estimates were given first for Great Britain and then separately for England and 12
A similar technique was used in Japan and Yugoslavia too but to a lesser extent.
THE USE OF SAMPLING METHODS IN TABULATION
101
Wales, and for Scotland. Within England, data were presented first by 9 English Standard Regions, the smallest being the southern (2.6 mil lion persons) and the largest London and southeast England (10.9 mil lion). Later they were presented by (i) 5 Great Britain density aggre gates; (ii) 17 large areas (7 conurbations, 8 counties and cities in England and 2 in Scotland); (iii) 49 medium areas; (iv) 248 smaller areas, etc. The program for these advance estimates included the fo11owing groups of tables: I II III IV V VI VII VIII IX X XI
: Age and marital condition (5 tables) : Occupation (11 tables) : Industries (4 tables) : Housing of private households (7 tables) : Social and economic characteristics of private households (7 tables) : Composition of private households (8 tables) : Institutions and other nonprivate households (8 tables) : Education (8 tables) : Place of birth and nationality (4 tables) : Fertility (11 tables) : Welsh and Gaelic languages (2 tables)
Needless to say, the precision of the figures published in the above tables varies according to the proportion being estimated and the size of the population. For this reason, it was not possible to publish equally detailed information on aJl the areas taken into consideration in preparing the plan for tabulation. To avoid publication of figures with too large sampling errors, the procedure of contracting several cells in frequency tables was adopted for smaller areas. By this means the proportions estimated were increased within a given size of population and the sam pling errors consequently decreased. For example, the table "Age and marital condition " has very different classifications for areas of different size. In the table prepared for Great Britain, the ages are broken down into individual years, with the last class being 95 years and over. Under the heading of marital condition separate figures are published for males and females by presenting first the total number of persons within these two classes for a given year and then the analysis of these totals into subtotals showing the number of persons single, married, widowed and divorced. In the same table for England and Wales, Scotland and regions of England, the number of age categories was reduced to 23 and
102
SAMPLING METHODS AND CENSUSES
the classes "widowed " and "divorced " were contracted. In corre sponding tables for medium areas, the number of age categories was further reduced to 18. Here the age was crossed first with the total pop ulation divided by sex and then by the total number of married persons divided by sex. At the last stage are_ tables for smaller areas. They are further reduced: 3 marital condition and only 8 age categories are crossed with sex distribution. A specimen of these four types of "Age and marital condition " tabics are reproduced in Appendix 2. These contractions of the tables on the program are certainly more than a mere sampling problem. In the publication referred to, the tables "Age and marital condition " cover three pages in its detailed form for Great Britain. If it had to be published in the same detailed manner for all the areas included in the tabulation plan, approximately 1,000 pages would be needed for this table alone, involving costs and preparation time that can easily be imagined. These and r, the actual size of the sample will be n = k. For I≤ r we have n = k + 1. In a population of N = 21 and f = 0.25, n will be 6 for I= I and 5 for I > I. In other words, if a given sampling fraction is applied to the same population, the resulting size of the sample will be different depending upon the random start. It will easily be seen, however, that on the basis of chance variations n can assume only two values, either k or k + I. n,
130
SAMPLING METHODS AND CENSUSES
If in a given case JN = 1,328.4, the actual size of the sample should be either 1,328 or 1,329. Sample sizes different from these two numbers are not accounted for by chance variations. In other words, the total size of the sample is affected by the selection bias 2 whenever or
nk+l
l
(7.8)
As to the effect of the selection bias on the actual size of the sample in the case of systematic sampling there is no need for long discussion. If the enumerators do not follow the instructions given, the result will normally be that the actual size of the sample will deviate more or less from the expected size. This happens when the enumerators forget to include some units in the sample or if they include in the sample each fifth unit, starting the count again on each page instead of from the be ginning of the enumeration. The adjustment is a technique of removing discrepancies between the sample estimates and their corresponding result in the complete count, irrespective of the origin of these discrepancies. The possibilities of the adjustment depend upon how much is known about the population. Let us assume that in Table 25 only N, the total number of units in the popu lation, is known. In this case, there is no need for N'. The first improve ment that this knowledge brings will be obtained by replacing in the estimates (7.2) through (7.6) the prescribed sampling fraction, viz., f, by the actual sampling fraction, viz., f'. The latter is obtained from the re lationships f' = n/N, where n is the actual size of the sample. If f' is inserted instead off in the equations (7.2) through (7.6), the computational effort in the preparation of the corresponding estimates will considerably increase. As compensation for this effort, we remove all the biases from the corresponding estimates that are associated with the size of the sample 2 The relationships in (7.8) represent a valid indicator of the presence of selection bias on the assumption that only one random start is used and the selection is car ried over the whole population. In practical work, however, the country is divided into enumeration districts or similar area units and the selection of the sample might be carried out independently in each of them by using different random starts. In this case, the range of chance variations of the real size of the sample as expressed by (7.8) is valid within each of these units but is no longer valid for the whole population. Therefore, the range of chance variations is much larger than in (7 .8). For the sake of simplicity, we shall continue with the test provided by (7.8) and the assumption made therein.
ADJUSTING SAMPLE RESULTS
131
and would insert themselves into the estimates if the prescribed sampling fraction were used in the estimation procedure. The following is another improvement. If all Nr are estimated from the sample, we normally have � N; � N. This is also true even if f' r
is used instead off Discrepancies of this type might be a source of confusion in some tables and therefore we might wish to achieve "2:.N; = N. To achieve this effect there is no need to compute f'. Instead, we can directly compute (7.9)
and (7.10)
As "£pr = 1, we also have 'l..N;' = N It was pointed out that the use of N removes from the estimates listed above the bias associated with the sampling fraction or the total size of the sample. However, it does not remove the biases that are due to preferences for some special type of units in the selection of the sample. The case in point is N;2 • The sample might be selected in such a way that the proportion nr2 /n is significantly different from Nr2 / N. In this case the estimate N;2 will normally be biased even if f' is used instead of f A great part of the danger of such biases is removed if Nr becomes known. This makes it possible to compute the ratio
(7.11) and use the estimate N;� = Nr X rr
(7.12)
Namely, the estimate N;� will be free from the bias if there is no pref erence in the selection of the sample for either the subclass Nr1 or Nr2 • If the selection has favored any of these subclasses the bias will remain. It is thus seen that the more is known about population, the more ample may be the program of adjustment and the further we might be able to go in the elimination of biases from the sample estimates.
132
SAMPLING METHODS AND CENSUSES
A number of methods have been worked out to deal with the problem of computational adjustment in more complex cases such as the cross classification tables. An example of these methods is the use of the theory of least squares which, as applied to the problem of adjusting cell frequencies, on the basis of the known marginal totals, means reducing the sum of squares of deviations between the estimated cell frequencies and the adjusted values of those frequencies. The result thus achieved is the agreement between the sum of the estimated cell frequencies and the known marginal totals. In its exact formulation this method involves a considerable amount of calculation. However, certain approximate methods and short cuts have also been developed in an attempt to reduce the bulk of the calculation. 3 A large choice exists, therefore, of theoret ical alternatives, depending upon the aims to be achieved, on the facil ities available for calculations, and on the time remaining before data must be released. Let us now see how the estimates (7.2) through (7.6) can be improved by changing the sample. If nothing is known about population, these equations are used as they are with f fixed in advance in the design of the sample. With N known, we can check whether the actual size of the sample deviates significantly from the expected size ne. In fact, ne = fN. If ne -¥- k, and ne ¥= k + I as indicated in (7.8), it means that there is some selection bias in the sample. For example, if n = 985 and.fN = 972.3, the selection of the sample was not correct. The correct sample size should be either 972 or 973. The actual sample size has a clear excess of 12 units. If this number of units is selected at random from those in the sample and discarded, the process of adjustment of the sample size is accomplished. By this procedure, f is made to agree with what was initially planned, such as 0.1 or 0.01, or any convenient number. The s The problem of adjustment has been studied very closely in connection with cen suses of population and the reader who happens to be interested in the theory concern ed may wish to consult the following literature: W. Edwards Deming. Statistical adjustment of data, New York, Wiley, 1946; W. E. Deming and F. F. Stephan. On the least squares adjustment of a sample frequency table when the expected marginal totals are known, Annals of Mathematical Statistics, Vol. 11, 1940, p. 427-444; ·M.A. El-Badry and F.F. Stephan. On adjusting sample tabulations to census count, Journal of the American Statistical Association, Vol. 50, 1955, p. 738-762; J. H. Smith. Esti mation of linear functions of cell properties, Annals of Mathematical Statistics, Vol. 18, 1947, p. 231-255; F.F. Stephan. An iterative method of adjusting sample fre quency tables when expected marginal totals are known, Annals of Mathematical Sta tistics, Vol. 13, 1942, p. 116-178: P. Thionet. L'ajustement des resultats des sondages sur ceux des denombrements, Review of the International Statistical Institute, Vol. 27, 1959, p. 8-25.
ADJUSTING SAMPLE RESULTS
133
use of these numbers in the estimation procedure will either remove the biases in the estimates (7.2) through (7.6) or improve their accuracy. One or the other will occur depending upon the characteristics of the selection bias. If Nr becomes available, it will make possible a separate adjustment of the sample size in each class of distribution. Here again the expected size of the sample in the r-th class will be nre = fNr. The quantity nre is then compared with n7 and the same procedure is followed as above. This additional adjustment will further improve the estimates. As the possibilities of the improvement of the estimates with this type of adjustment also depend upon the amount of knowledge about the population, it might be advantageous, if this adjustment is on the program, to postpone the tabulation of the sample as long as possible, even until the complete tabulation is over. Several classifications of holdings may then be available for improving the composition of the sample. In practical work, a limit has to be placed on the number of such adjustments. Changing the sample on the basis of random selection by adding or removing certain units takes up time and it cannot be car ried out according to all the classifications in the complete tabulation. Two or three distributions will in fact suffice if they are otherwise effi cient as criteria for stratification. An example of such distributions may be the classification of holdings by size; another is the classification by "type of farming,” and so on. If the classification of holdings by size is used for adjustment purposes, it will be easily detected whether the selection of the sample was carried out with a preference for any partic ular size of holding. If so, the sample will be adjusted accordingly. Since the size of holdings is correlated with a number of other charac teristics on the census program, it means that an adjustment performed on the basis of data concerning the size classification will be efficient with regard to many other characteristics as well. The same type of adjustment may also improve the accuracy of estimates from another point of view. In order to illustrate the point, it will be assumed that a sample survey has been taken together with the complete enumeration census for the purpose of broadening the scope of the latter. For the sake of simplicity, it will be assumed that a simple random sample of holdings was used. The task is to estimate, inter alia, the total number of nonfamily workers, employed as of the census day, by various size classes of holdings. The punch cards for n holdings included in the sample are available and they are first classified by size.
134
SAMPLING METHODS AND CENSUSES
Now, if the size of the sample in the r-th size class is n,, and x,; is the number of nonfamily workers employed in the i-th holding of the r-th size class, the average for the class will be x,
1 =�X,; n,
(7.13)
i
An unbiased estimate of the mean for the population as a whole is given by -
l n r z.
(7.14)
X=-��X· ,,
An alternative way of expressing (7.14) is 1 x = - �n,x, n r
= � w,x, r
l
(7.15)
The estimate (7.14) is based on the simple random sample of size n. In sampling without replacement, its variance is equal to N-n a11 ax!.=-- N-1 n with cr 2 being the per holding variance in the population. Let us now assume that N,. has become available. This information is used to change n, into n; in such a way that N, n; --=--
n
N
This is equivalent to using N,/N in (7.15) instead of W,. In this case, (7.15) can be considered as the estimate of the mean from a stratified sample with proportional allocation. This mean has the variance a!.=� ( x r
N, 2 N, - n; ) N N,-1
a; n;
(7.16)
ADJUSTING SAMPLE RESULTS
135
which can be expected to be smaller than the variance of simple random sampling. Formula (7.16) is the variance of the mean in stratified sampling with proportional allocation which is, in general, lower than the variance of a corresponding simple random sample as long as at least two size class means are different. With the size classification of holdings as the basis for adjustment, one is likely to get large differences between the means of various classes and consequently a reduction in the variance as well. It may thus be considered that the adjustment of the size of the sample and its composition by adding or removing some units not only provides a means of removing biases from the sample but also a way of improving the mean square errors. Choosing the type of adjustment Before the use of one or other of the two types of adjustment is planned in practice, one should be aware of one aspect of the difference between them. Computational adjustment is theoretically fully correct and it can be applied to any extent. In fact, the use of the known results from the complete count is always recommended in order to improve the pre cision and remove biases. It is different, however, with changes of the sample itself. If carried to extremes, the operation of changing the sample by taking units away and adding new ones, or by duplicating the existing ones, means an aban donment of the principle of random selection and the adoption of a somewhat arbitrary structure of the sample achieved by successive additions. Such an operation destroys, in principle, the random character of sampling distributions and makes the theory inapplicable. The adjusted estimates are accompanied by corresponding sampling errors which are intended to indicate the range of sampling variation of these estimates. Sampling errors apply only to cases of strict random selection, i.e., to estimates derived from samples subject to random dis tribution. Sampling errors computed from a sample obtained after a number of successive adjustments and changes might, strictly speaking, be a poor indicator of the variations associated with the respective esti mates. In essence, therefore, this type of adjustment is not in full accord with the theory. This does not mean, however, that it should not be applied in practical work. If post hoc changes of the structure are not justified, the same objection should be raised with respect to the deviations from the theory
136
SAMPLING METHODS AND CENSUSES
in the selection itself of the sample. In the case of systematic samples, for instance, the prescribed procedure says that every k-th holding should be selected or, in other words, that a sampling fraction/ should be applied. If an examination of the total size of the sample and its size by various classes reveals that the prescribed procedure has been violated, the ad justment of the sample may be interpreted as a violation in the reverse sense to that committed during the selection procedure. It is very difficult to determine the value of this type of adjustment. The fact that it helps to remove biases may be of inestimable value in some situations. On the other hand, it opens the door to a danger of unknown dimensions. Where is the limit to arbitrariness in changing the sample? If carried sufficiently far, changing the sample may be interpreted as going back to a purposive selection and all the complications involved in such a procedure. If applied with certain restrictions, the question still remains as to whether and to what extent the available theory can be used in interpreting the results of such samples. In addition, removing the units selected in adjusting strata samples means jettisoning a part of the information available. If a duplication of the existing units is used instead of adding new ones, it leads to an increase in the variance. 4 It is, therefore, not easy to say what one can expect from such an adjustment in a concrete case. It is believed, however, that advantages achieved in some special cases, such as biased selection, provided that the amount of changes is kept to a minimum, considerably outweigh the losses. In deciding on the aims of the action to be taken in adjusting the sample, the procedure used in its selection will be of basic importance. In point of fact, the selection will, in many cases, determine at least an outline of the action to be taken. To be precise, if a special procedure 5 is applied to select the sample, the bias will, in principle, be catered for automatically and there will be no need to think of the adjustment as a means of removing the effects of the selection bias. On the other hand, if in carrying out the procedure of automatic selection a large number of persons are involved, selection biases are likely to be present, and the problem they constitute becomes one of primary importance. The selection of the sample is thus one problem to be considered in planning adjustment action. The other is connected with the question 4 Cf Morris H. Hansen, William N. Hurwitz and William G. Madow. survey methods and theory, New York, Wiley, 1953, Vol. 2, p. 140-141. 5 In the sense of the terminology used earlier.
Sample
ADJUSTING SAMPLE RESULTS
137
of the amount of adjustment that should be attempted in a given case. The answer to this question stands in close relation to the program for the use of sampling methods in a census. If sampling methods are used for broadening the scope of both the census and the tabulation programs, this involves work which can easily wait until the complete tabulation is over. In that case, a number of results from the complete tabulation will be known that can be used for purposes of either computational adjustments or changes in the sample. If these two uses . of sampling methods are adopted, it might be important to postpone the tabulation of the sample as long as possible so that the availability of the results of the complete count can be utilized to the fullest extent. This remark refers particularly to cases where automatic selection of the sample is applied for purposes of broadening the scope of the census. Changes of the sample may prove essential here. If advance estimates have to be prepared, however, tabulation of the sample has to start as early as possible. It might then easily happen that at the moment of its execution nothing more is available as the re sult of the complete count than the total number of holdings by various administrative units. It again means that a limited amount of adjustment connected with the total number of holdings can be taken into account. However, if the advance estimates are accompanied by the two other uses of sampling, the procedure of adjusting might be usefully split up into two parts: a limited amount of adjustment with advance estimates, and a more complete process of adjustment in connection with the two other uses. Particular emphasis should be given to the need for a wide adjustment program in the latter cases, because data collected by means of a supplementary sample survey are tabulated on a sample basis and, therefore, every possible expedient has to be exploited to remove biases from the sample and increase the precision of the estimates. Decisions regarding the adjustment also depend upon the program of tabulation. Even the most complicated computational procedures can easily be used to adjust a small number of tables. For example, in the preparation of the advance estimates of the 1949 Census of Agriculture in the Federal Republic of Germany, the total number of holdings was known for each size class. When 2 percent of the class frequencies available from the complete count were compared with the size of the systematically selected sample of 2 percent of questionnaires, an important discrepancy was found. To obviate the difficulties which would have arisen if the sampling fraction of 0.02 had been applied for each size
138
SAMPLING METHODS AND CENSUSES
class irrespective of the actual size of the sample, the exact raising fac tors were computed on the basis of the known class totals. 6 Such a procedure leads, in principle, to a considerable amount of computation which was possible in this case because of the very small number of tables. rt may be equally important to take into account the sampling fraction used and the size of the population or the size of the administrative units by which the estimates have to be prepared. If a systematic sample with the sampling fraction f is selected, the magnitude of the discrepancy based on chance variations will vary depending on the random start. Its maximum value will be ( � - I) units.
With a sampling fraction of
f = 0.2 and N = 1,000 units, it makes 4 units, or 0.4 percent of the total. With N = 1,000,000, the maximum possible disagreement would be even difficult to express for practical purposes. In other words, with both large sampling fractions and populations and a selection procedure free from biases, practically nothing could be gained by using N or N, for adjusting the results. The adjustment is of primary importance in the case of automatically selected samples, which will usually be subject to selection biases. How ever, before any action could be planned for adjusting such samples, one has to determine whether the sample is biased and, if so, to what extent. The first check on the presence of biases might be performed on the total size of the sample. However, such a check opens up rather limited possibilities because of compensating effects within the sample itself. Some parts of the population may be overrepresented and some underrepresented, with the final result that the total size of the sample is equal to its expected value resulting from the sampling fraction adopted. It becomes necessary, therefore, to check for the presence of biases in various parts of the sample as a whole. By doing so, one has to break down the total size of the sample into as many parts as fea sible, according to the administrative division of the country and the availability of the known total number of units from the census. If the parts used are too large, the compensating effects within each of them may remain undetected. The chances of detecting biases increase if the size of the parts is reduced, and this will usually present no dif ficulty if the administrative division of the country is used for the purpose. 6 Rudolf Giehl. Die repriisentative Vorwegaufbereitung der landwirtschaftlichen Betriebszahlung 1949, Zeitschrift des Bayerischen Statistischen Landesamts, Vol. 89, 1957, p. 152-163.
ADJUSTING SAMPLE RESULTS
139
Small units such as communes and groups thereof, or larger units like districts and/or counties, can be conveniently used for this work. The totals of units as obtained by the census will be known by such areas. From these totals, the expected size of the sample is obtained and a comparison of the actual size of the sample with the expectation shows the amount of adjustment needed. Such an adjustment should, if possible, be carried out in several stages. The result achieved after checking parts of the sample secures an adequate number of units from each area, but it does not prevent a preference for including in the sample some special type of units, such as large agricultural holdings or small ones, as the case may be. Further adjustment in this respect is feasible if some distributions of holdings become known from the complete tabulation. The number of holdings selected in the sample from a particular class is again compared with the expected size of the sample, and according to the result obtained the sample is further adjusted. lLLUSTRATIONS
The principles presented in the last paragraph of the previous section were applied in adjusting the sample used in the 1950 United States Census of Agriculture. This was a 5 percent sample of nonlarge farms, while all the large farms were included in the sample. The selection procedure was similar to, but not identical with, a systematic selection. In order to check the presence of selection biases, the questionnaires obtained in each county were separated into three groups: (i) large farms; (ii) farms in the sample; and (iii) the nonlarge farms which are not in the sample. After that, the percentage in each county of nonlarge farms in the sample was computed and compared with the expected size of the sample, i.e., 20 percent. If significant disagreements were found, the sample was adjusted so that the number of farms was de creased by eliminating a certain number of questionnaires selected at ran dom from the group of those in the sample, or by increasing it by adding questionnaires selected at random from the nonlarge farms not included in the sample. Checking on the presence of biases in the selection of the sample in a larger area consisting of several enumeration districts would follow the same line as in the above simplified cases if the selection were carried over district boundaries without interrupting the constant sampling
140
SAMPLING METHODS AND CENSUSES
interval. If the selection in each enumeration district were independent, the expected size of the sample in a bias-free selection would be equal to JN, although the range of chance variations is larger under this assumption. A corresponding probability distribution can be con structed which makes it possible to determine the points within which the total size of the sample is supposed to fall with a given probability. In the illustrations -given earlier of the automatic selection of the sample in the United States Censuses of Population and Agriculture, such checks on the presence of selection biases cannot be used because the selection procedure is not a systematic one. If form A2 (reproduced as Appendix I) is observed, which was used in selecting the sample in the 1954 United States Census of Agriculture, it will be seen that it does not give a systematic sample. If the procedure of listing households were applied as prescribed, the form would secure on average a sample of 20 percent but it would not secure the selection of every fifth holding. Such samples are therefore considered as simple random samples. If P denotes the expected proportion of holdings to be enumerated on shaded lines, the standard deviation of the number of units in the sample will be VNPQ. In a population of N = 1,600 and the sampling fraction f = 0.2, it gives the expected size of the sample NP = 320 with a standard deviation of 16 units. In other words, the chances are that 95 times out of 100 the number of holdings enumerated on shaded lines will be in the range of 320 + 2 x 16, i.e., between 288 and 352. Alternatively, it can also be interpreted as follows: there are 5 chances in every 100 that the size of the sample free from selection bias falls outside the range of 20 + 2 percent. On the basis of such a calculation, a table can easily be con structed for purposes of testing the presence of biases in the size of the sample. With a fixed sampling fraction the quantity V PQ is constant. Several values of N can then be used to prepare a table or a chart showing the acceptance limits of the size of the sample. The samples with sizes falling outside the range established are adjusted according to principles explained. The results of a similar check conducted in connection with the 1950 United States Census of Population are presented in Chapter 5. It will be seen that sample sizes for some counties deviate from the expected size by more than four standard deviations. In changing the sample belonging to a large area, it may be important to work out details as to where to take units to be added to the original sample or what units should be eliminated. Without such a provision,
ADJUSTING SAMPLE RESULTS
141
changing the sample might introduce further deterioration in it. To illustrate the point, let us assume that the area in question consists of sev eral districts with some of them overrepresented in the sample and some underrepresented. If the adjustment were performed, in this situation, by adding some more units from overrepresented districts and eliminat ing some others from the underrepresented districts, the adjusted sample would be worse than the original one from the point of view of correct representation of individual districts. The procedure used in adjusting the sample of the 1954 United States Census of Agriculture will be taken as an illustration of the way of mak ing this provision. 7 It will be remembered that this sample included all the farms of 1,000 acres and more, and 5 percent of farms of a size smaller than l,000 acres. From the complete count, data were avail able on the total number of farms in each size class. These data were available by counties and the adjustment was performed by state economic areas which consisted on average of several counties each. For adjust ment purposes, the size of the population in each size class was divided by 5, to arrive at the expected size of the sample. The difference was then found between the expected and the real size of the sample, and· this shows the direction and the amount of adjustment to be made in each size class. The results of such a procedure for an economic area consisting of five counties are presented here in Table 26. Column 4 of this table shows the differences between the expected and the actual size of the sample. The action to be taken on the basis of these differences is described in the last column. In the first line the difference is -8.2 holdings, which means duplicating 8 questionnaires selected at random from the available 90 in this class. In the third line the difference is 6.6 holdings, which means removing from the available sample 7 ques tionnaires selected at random. With regard to the question of where to take 8 questionnaires that have to be duplicated in the class of farms under 10 acres, the total of 491 holdings in this class was broken down by counties and the data in Table 27 were obtained. Column 1 of this table shows the number of farms under 10 acres in each county. The remaining columns contain information similar to that in Table 26. It may also be seen from this 7 All data presented here are taken from United States Bureau of the Census. U.S. Census of Agriculture, 1954, Vol. 3, Special reports, Part 12, Methods and procedures, Washington, D.C., U.S. Government Printing Office, 1956, p. 67-68.
142
SAMPLING METHODS AND CENSUSES
TABLE 26. • DATA SHOWING THE DIRECTION AND THE AMOUNT OF ADJUSTMENT IN AN ECONOMIC AREA (1954 United States Census of Agriculture) Size group (total acres in place)
Actual Expected number Difference number between in sample in sample expected Total number (total as desand of farms• number ignated actual divided by enu- numbers by 5) merator
Under 10
491
98.2
90
-
10-29
596
119.2
99
-20.2
30-49
492
98.4
105
+
50-69
734
146.8
142
- 4.8
70-99
988
197.6
200
+
100-139
I 379
275.8
258
-17.8
140-179
1 007
201.4
192
- 9.4
180-259
1 274
249.4
267
+ 17.6
260-499
1103
220.6
230
+
9.4
500-999
199
39.8
43
+
3.2
1
8.2
6.6
2.4
Adjustments to be made
Duplicate information 8 questionnaires Duplicate information 20 questionnaires Eliminate information 7 questionnaires Duplicate information 5 questionnaires Eliminate information 2 questionnaires Duplicate information 18 questionnaires Duplicate information 9 questionnaires Eliminate information 18 questionnaires Eliminate information 9 questionnaires Eliminate information 3 questionnaires
on on on on on on on on on on
Excludes specified farms.
table .that the first three counties are u·nderrepresented in the sample of farms under 10 acres. Therefore, 8 duplications are assigned to these three counties. TABLE 27. • COMPARISON OF EXPECTED AND REAL SIZE Or THE SAMPLE BY COUNTIES 1 Total number of farms•
Expected number of sample farms
Actual number of sample farms
Difference
Chemung Schuyler Steuben Tioga Tompkins
78 69 113 121 110
15.6 13.8 22.6 24.2 22.0
10 13 20 25 22
-5.6 -0.8 -2.6 + 0.8 0.0
Total
491
98.2
90
-8.2
County
1 This table is reproduced from: United States Bureau of the Census. U.S .. Census of Agriculture, 1954, Vol. 3, Special reports, Part. 12, Methods and procedures, Washington,
D.C., U.S. Government Printing Office, 1956, p. 67-68. • Excludes specified farms.
ADJUSTING SAMPLE RESULTS
143
In determining the number of adjustments to be assigned to each county and the order of these adjustments as well, the criterion adopted was the ratio of the difference between the expected and the actual size of the sample to the standard deviation. The standard deviation, a, of the expected size of the sample in county No.I is equal to VNPQ = = V78 x 0.2 x 0.8 = 3.53. The difference between the expected and real size of the sample for the same county is 5.6. The sign of the difference is disregarded because it is immaterial for this ratio. Accordingly, the value of the ratio for the first county is 5.6: 3.53 = 1.59. The corres ponding ratios for counties Nos.II and Ill are 0.24 and 0.61 respec tively. It will be seen that the largest value of the ratio is for county No.I. This means that the adjustment has to start there. After I of the 10 units in the sample from that county has been selected at random and the punch card for it duplicated, the real size of the sample for this partic ular county increases to 11 and the difference drops to -4.6. Accord ingly, the new value of the ratio becomes 1.30, which is still the highest value among the three. This means that adjustment has to be continued in the same county by duplicating data for another unit. After the second adjustment is made, the difference drops to -3.6, which makes the value of the ratio 1.02. In other words, the third adjustment also belongs to the same county. If the process is continued, it will be found that the first four adjustments have to be made in county No.I, the fifth in county No.Ill, the sixth again in No.I, the seventh in No.III, and the eighth in county No.II. Thus the total number of adjustments per formed by duplicating cards in the class of farms under 10 acres is 8, which is indicated in Table 26. Obviously, such a procedure increases the amount of work needed to perform the adjustment. However, its advantage lies in the fact that it secures the effects of a stratified sample in addition to the removal of biases due to incorrect representation in the sample of either size classes of farms or counties constituting the economic areas. More work in this case simply means better chances of obtaining accurate data. If less protection against biases can be accepted with no effects on the stratification desired, the work of adjustment can be made much simpler. Another interesting point is the total amount of adjustment made in this census. Corresponding data are presented in Table 28. Column 4 shows that the total number of changes was 67,119. This table also shows the need for an adjustment by size of farms, because in the original sample
144
E
SAMPLING MTHODS AND CENSUS.ES
the large farms were overrepresented and the small ont!s underrepresented. The last column of the table shows that the net effect of the adjustment for the large farms consists in the elimination of a part of the sample and for small farms in adding new units to the original sample. TABLE 28. - SUMMARY OF SAMPLE ADJUSTMENT BY SIZE OF HOLDING FOR THE UNITED STATES ( 1954 Census of Agriculture) 1 Adjustments in number of farms Size of farm
Under 10 acres 10-29 acres 30-49 acres 50-69 acres 70-99 acres 100-139 acres 140-179 acres 180-259 acres 260-499 acres 500-999 acres 1,000 acres or more Total
Number of farms
Total adjustment
Net Farms adjustFarms Farms duplicated ment plus duplicated eliminated farms (number eliminated of farms)
484 291 713 335 499 496 346 323 517 740 491 158 461 651 463 698 482 246 191 697 130 481
7 676 7 468 5 048 3 204 3 661 3 076 2 562 1 974 l 886 626
977 1 903 1 886 1 768 2 919 3 205 3 253 4 220 5 109 4698
8 653 9 371 6 934 4 972 6 580 6 281 5 815 6 194 6 995 5 324
4 782 416
37 181
29 938
67 119
+ 6 699 + 5 565 + 3 162 + 1 436 + 742
-
129
- 691
-2 246 -3 223 -4072
+ 7 243
1 This table is reproduced from: United States Bureau of the Census. U.S. Census of Agriculture, 1954. Vol. 3, Special reports, Part. 12, Methods and procedures, Washington,
D.C., U.S. Government Printing Office, 1956, p. 67-68.
Problems of a somewhat different nature may be encountered in deal ing with samples of clusters, such as enumeration districts, households, area segments containing a varying number of agricultural holdings, etc. The variation in the size of clusters as expressed in terms of the number of elementary units will in general increase the discrepancies between the estimates and the quantity being estimated with respect to the amount of discrepancies that can be expected in the case of an equal size of the sample of elementary units. For the sake of illustration, let us suppose that an area has 3,547 agricultural holdings with 14,188 persons, i.e., an average of 4 persons per holding. If a systematic sample of 1 per cent is selected from this population with a random start of 61, the sam ple will contain 35 holdings and will yield an estimate of 3,500 holdings. If the average size of the holdings in the sample is 4 persons, the esti-
ADJUSTING SAMPLE RESULTS
145
mate of the total number of persons living on holdings will be 14,000. In other words, there is a deficit of 188 persons. An additional source of chance variation comes into effect if the number of persons has to be es timated. Suppose that a large holding has been selected in the sample so that 34 holdings give an average of 4 persons per holding but the 35th holding has 24 persons. In this case a sample total of 34 x 4 + 24 = 160 persons is obtained. Multiplied by 100, this gives an estimate of 16,000 persons in the area, or an excess of 1,812 persons, which is far from being negligible. One way of controlling the variations in the size of clusters would be to use some data for the same clusters from a previous survey and apply the ratio method of estimation. Such a procedure will normally not be suitable in the use of sampling methods in censuses. Firstly, it involves an amount of computational work which is by no means negligible. Quite a large number of items on the census program have to be taken into account. In addition, it is difficult to expect correlated information to be available for any size of clusters that might be used in the new cen sus. The last census may well have been taken some ten years previously and the segmentation of the country for the new census may not have retained the old cluster borders. An alternative solution would be to equalize clusters as far as possible. It has been shown that the sample used for the preparation of the ad vance estimates in the 1946 Census of Population in the Federal Republic of Germany consisted of artificially formed clusters of JOO persons each. It has also been shown that in the 1950 Census of Population in Japan a sample of enumeration districts was extracted. The enumeration dis tricts were prepared in such a way that they had an average of 50 house holds with a range of variation between 30 and 70. Larger districts were divided into several parts to comply with the size characteristics of the others. A similar precaution against the effect of the variations in the size of clusters was also taken during the preparation of the advance estimates in the 1954 Census of Population in France. 8 The sample selected consisted of 5 percent of dwelling units in urban areas and houses in rural areas. However, before the selection started all the large and institutional households were removed and treated separately. The sample was thus selected from the remainder, consisting of less variable households. 8 Institut national de la statistique et des etudcs cconomiqucs. au l/20eme, population, menages-logements, Paris, 1955.
Resultat du sondage
146
SAMPLING METHODS AND CENSUSES
lf clusters of variable size are used as sampling units, it may sometimes be useful to apply the above type of adjustment and change the structure of the sample so that some estimates based on it (such as the total num ber of persons) agree with the known census totals. By doing so, it is also believed that the estimates of the characteristics correlated with the number of persons will be improved as well. A need for action of this type may arise if a systematic sample of clusters is selected with a small sampling fraction and the estimates have to be prepared by relatively small administrative units. Such was, in broad terms, the situation in which it was decided to adjust the sample of households used in the 1951 Census of Population in Great Britain. rt will be remembered from Chapter 6 that this sample consisted of I percent of households, and that every household with a serial number ending in 25 was selected in odd enumeration districts and those having serial numbers ending in 76 in even enumeration districts. The sample was selected with a view to preparing the advance estimates and broaden ing the scope of the tabulation program. A number of tables had to be prepared by small administrative units which in some cases did not have more than 20,000 persons. With an average of a little more than 3 persons per household, it means a sample of approximately 60 house holds in such regions. This is the size of the sample where the variations in the size of households might have a considerable effect on discrepan cies between the sample estimates and the census totals to be known later after the results of the complete tabulation have become available. To improve the sample in this respect it was decided to adjust it in such a way that sample estimates of the total number of both persons and house holds in such areas agree with census totals known from the preliminary count. The adjustment was effected by adding a number of households to the initial sample if the estimates were incomplete or by taking some away if the estimates were in excess. However, complications appeared because in some areas the estimated total number of persons was in agreement with the census but the total number of households differed, and vice versa. For this reason, in some areas both removals and additions were necessary to improve the composition of the sample. It was not possible to supply a mathematical solution to the problem of how to change the initial sample. The adjustments performed represented, to a large extent, an arbitrary operation which had to follow some general principles to reduce the influence of personal biases that might be intro-
ADJUSTING SAMPLE RESULTS
147
duced into this work quite unconsciously. These principles were: (i) the number of households involved in any adjustment should be as small as practicable; and (ii) the size of the households involved with adjust ment should not be concentrated in a single size ·Category but distri buted according to the real frequencies. An illustration of this procedure is given in Table 29. TABLE 29. • ILLUSTRATION OF PROCEDURE USED IN ADJUSTING SAMPLE ESTIMATES TO KNOWN CENSUS TOTALS
(1951 Census on Population in Great Britain) 1 Arca
Original sample: excess or defect Person,
. Adiuslmenl prescription
Households
East Ham Wallasey Orkney County Chigwell
+ 22 -20 + 34 + 17
-1
{
Huntington
-"- 33
-2
Ayr Burgh
-27
+I
{ f l
1
+6 -4 +7
Remove 6 households of sizes 2, 3, 4, 4, 4, 5 Add 4 households of sizes 4, 5, 5, 6 Remove 7 households of sizes 4, 4, 4, 5, 5, 6, 6 Add 6 households of sizes 1, 2, 2, 3, 3, 4 Remove 5 households of sizes 6, 6, · 6, 7, 7 Add 7 households of sizes 5, 5, 5, 6, 6, 7, 7 Remove 4 households of sizes 1, 2. 2, 3 Add 6 households of sizes 5, 6, 6, 7, 7, 7 Remove 6 households of sizes I, l, 2, 2, 2, 3
United Kingdom. General Register Office. Census 1951, Great Britain one percent
sample rabies, London, H.M.S.O., Pt. J, 1952.
In using this procedure it was not, however, always possible to get the complete agreement of the sample and the census with respect to the number of both the persons and the households. The adjustments, therefore, " were designed to remove the population deviation · entirely and to improve the ratio of the population to households in all areas; they also aimed at eliminating or reducing the household deviations where possible, and ensuring that in no area was the remaining deviation after adjustment more than one percent." When the process of adjust .ment was finished it was found that 1,266 new households were added to the sample and 1,255 removed. In other words, less than I percent of the total size of the sample was withdrawn. In most areas this procedure gave the necessary adjustments without difficulty. In some cases, however, a more complete examination of individual records was needed. For example, in Chelsea Metropolitan Borough, where 603 persons and 187 households were found in the
148
SAMPLING METHODS AND CENSUSES
sample, it happened that the number of persons was in excess by 94 and the number of holdings by I only. An exhaustive examination of the case revealed that the sample was correctly selected but included a hotel with 100 persons. The hotel was removed from the sample and the defect of 6 persons was corrected by adding a household of 6 members. The report states that this procedure, according to tests made, improved the sample considerably. However, it could also be said that it intro duced a certain bias. As in the case just mentioned, the large households and particularly institutional ones, were often removed to reach the bal ance. It might have resulted in an underrepresentation of such house holds in the sample. An additional bias might have been introduced by the personal judgment involved in the choice of households to be removed from or added to the initial sample. This review of the adjustment procedures shows that various adjust ment techniques are possible depending upon the sampling units used, the selection procedure, the purpose of the use of sampling methods in the census, the amount of information available from the complete tab ulation, and the reasons for adopting the adjustment. In addition to this, there are two other points that have to be taken into account, viz., the time needed to carry out the adju3tment, and the cost of the whole operation. An adjustment of the type used in the 1954 United States Census of Agriculture is not an easy task from the point of view of either time or cost. A sample selected through a special procedure may be more convenient in this respect for many purposes connected with census work because the selection bias is taken care of automatically and the sample selected can be used immediately after the enumeration, without waiting for the results of the complete tabulation, which other wise are essential for adjustment. However, the selection of such samples is costly if efficient samples are desired. Less costly samples are normally associated with insufficient precision. Consequently, the choice of an adequate adjustment procedure is bound up with a careful study of a wide range of different facts.
8. SAMPLING ERRORS
Computation of sampling errors
If sampling methods are used in order to broaden the scope of either the census or the tabulation program, this may easily result in the esti mates running into thousands. The problem then arises as to how to compute the sampling errors for such a large number of estimates. In considering this problem and in deciding about the procedure to be adopted, the following points should be borne in mind. 1. Number of estimates, i.e., the number of errors to be computed. If it is a question of only a small number, the task can be performed adequately on a simple desk calculator without entailing undue dif ficulties as to time and cost. 2. Type of design being employed. The computation of errors in a simple random sampling design means relatively little work as com pared with more complex designs. 3. Purpose of errors. If the sampling errors are considered merely as somewhat rough indicators of the sampling variation associated with the corresponding estimates, the time devoted to their computation will naturally be less than in cases where the components of the total variation have to be shown for various purposes of analysis. 4. Type of estimate. Errors of the estimated proportions in simple ran dom sampling are obtainable immediately. The computation, on the other hand, of errors of the estimated ratios of two continuo.us variables in, say, a multistage design entails a considerable amount of work. 5. Existence of appropriate facilities. Today electronic computers are being used on an ever-increasing scale for this purpose, and an enor mous amount of time can thus be saved in large-scale operations, even as compared with the standards set by automatic desk calcu lators and punch-card machinery. In a case reported by Daly and
150
SAMPLING METHODS AND CENSUSES
Hansen,1 where a special sample survey of farm expenditures was being conducted, it was necessary to compute approximately 1,200 variances and covariances. The necessary data were punched on 4,000 cards, each of which had to be weighted in a certain way in the course of the estimation process. The programing of machine instructions took about two man-days; 2 hours of computer time were needed to test the plan, while 30 minutes only were necessary for the computer to perform the task imposed. In this way the problem of computation of sampling errors will assume a somewhat different significance in the future, when more technical progress has been achieved and use of electronic computers becomes a routine procedure. Currently, however, automatic or semiautomatic calculating machines are the standard facilities in most countries for the computation of sam pling errors, together with modern punch-card equipment. With such facilities a computation of an even moderately large number of errors, such as several thousand, presents a serious problem. The procedure sometimes followed in such a situation is to publish estimates and simply neglect errors. A modification of the same basic procedure will be found in cases where errors for some principal items in the most important tables are published, say, those referring to the country as a whole, while the �rrors for less important items and those relating to subdivisions arc disregarded. Such a neglect of sampling errors may be difficult to justify. This attitude leaves users of data in a quandary as to the usefulness of figures published. To say that each sample estimate is associated with an error and to keep the magnitude of errors unknown would destroy the users' confidence in the estimates because there is nothing to place limits on their mistrust. On the other hand, the absence of sampling errors would equally give rise to unjustified reliance on figures which, because published, represent data that, it is assumed, can be used for any purpose. One of the advantages of sampling methods is precisely that it helps to show up the magnitude of errors associated with various estimates. Neglecting errors partly or entirely means a corresponding neglect of this advantage. Every 1 Joseph F. Daly and Morris H. Hansen. Data processing on electronic computers in the United States Bureau of the Census, Bulletin of the International Statistical Insti tute, Vol. 36, 1958, Pt. 4, p. 376-387. See also B.M. Church and S. Lipton. The use of an electronic computer in the estimation of sampling errors in a nutritional survey, The British Journal of Nutrition, Vol. 10, 1956, p. 27-32; S. Lipton. Some statistical applications of electronic computers, Applied Statistics, Vol. 6, 1957, p. 102-113.
SAMPLING ERRORS
151
effort should therefore be made to provide each estimate with its corres ponding error. In what follows an attempt will be made to review the basic tech niques in the computation of sampling errors. This review might be found useful while considering how to tackle this problem in a concrete case. The emphasis will be made on techniques which involve various short cuts. A number of procedures have been worked out to reduce the amount of computations involved in arriving at the estimates of sampling errors. However, only some of them will be reviewed here, viz., those which are general enough and can be expected to give a sufficiently broad guidance. The first technique that might be profitably used in appropriate cases consists in selecting a subsample from the sample actually used and estimating the variance from this subsample. If the original sample has n units, and a subsample of n' units is selected at random, the estimated variance from the subsample will be n' � (X; - .x')2 I
s2=---- n' - l
(8.1)
where x' is the arithmetic mean computed from the subsample of n' units. The variance resulting from this equation is then used in estimat ing the precision of the estimates based on the original sample of n units. Needless to say, in computing the magnitude of the standard error the original size of the sample needs to be taken into account. As regards the size of the subsample, approximately 50 units will be sufficient to get an acceptable precision of the estimated variance, provided the distribution of the characteristics involved is a normal one. Special care is needed in cases of highly skewed distributions and populations containing extremely large or small values. The size of the subsample required in order to obtain the same precision of the estimated variance in such cases will be much larger. 2 2 More details on various problems in estimating variances from non-normal dis tributions are given in William G. Cochran. Sampling techniques, New York, Wiley, 1953, Chapter 2; Edwards Deming. Some theory of sampling, New York, Wiley, 1950, Chapter IO; Morris H. Hansen, William N. Hurwitz and William G. Madow. Sample survey methods and theory, New York, Wiley, 1953, Vol. 1 and 2, Chapter 10; Pandurang V. Sukhatme. Sampling theory of surveys with applications, Ames, Iowa State College Press, and New Delhi, Indian Society of Agricultural Statistics, 1954, Chapter 2 A.
152
SAMPLING METHODS AND CENSUSES
If the sample used is stratified with a proportional allocation of units, it may happen that strata variances are constant, viz.: cr� = cr;. In such case the variance can easily be estimated by subtracting the between strata sum of squares from the total sum of squares and applying the number of degrees of freedom which correspond to the within-strata mean square. 3 The same procedure will also give a satisfactory approximate value of the error if cr� were not constant, provided the sample remains allocated proportionally. This procedure, of course, cannot be used if the variance of each stratum has to be estimated separately, as is the case with optimum allocation. In such a situation, the technique of grouping strata 4 may prove particularly useful if the number of strata is large. The technique consists in reducing the number of strata by merging them together and computing later the variances for such grouped strata only. The reduc tion of the amount of computation achieved then depends upon the re duction in the number of strata. The principle to be followed in grouping strata would be to retain strata resulting from the most efficient criterion of stratification and neglect the others. For example, in a survey of population, the stratification might be performed (i) geographically; (ii) by size of villages; and (iii) by proportion of agricultural population in villages. If the geographic stratification is retained and the other two criteria disregarded, the number of strata might be reduced considerably. At the same time, the basic effects of stratification will be retained in the final variance if the geogra phic criterion is an efficient one. In carrying out the computation of variances by this technique, all the data belonging to new and enlarged strata are grouped together and the variance is computed as if more detailed strata did not exist. It will normally lead to an overestimation of the sampling errors because the contribution to the efficiency of stratification due to the ignored criteria is lost. If, in applying this technique, strata sample sizes are large they can also be subsampled for the purpose of variance computation. Several other ways of reducing the amount of computation are based on the use of squares of differences between individual data. If the pop3 Walter A. Hendricks. The mathematical theory of sampling, New Brunswick, N.J., Scarecrow Press, 1956, Chapter 4, and F. Yates. Sampling methods for censuses and surveys, 2nd ed., London, Griffin, 1953, Chapter 7. 4 Further details on this technique will be found in Morris H. Hansen, William N. Hurwitz and William G. Madow. Op. cit., Vol. I, p. 438-9.
153
SAMPLING ERRORS
ulation consists of 2 units only, with the values of a characteristic being x1 and x2, the sum of squares of deviations from the arithmetic mean can be expressed as x1(
X1
+ 2
X2
2 )
+ (x - +2 2
X1
X2
2 )
1 ( )2 _ --x-x · 2 2 1
which makes it possibl� to express the variance of x1 and x2 as follows: cr2
=
I 2 -(x1 - x) 2 4
In contrast to the standard procedure in computing variances, which requires l:x; and l:x;, here the differences only have to be squared. In many cases these differences are small numbers, so that variances can be computed easily without using machines. A procedure based on these differences has been recommended in esti mating variances of stratified samples. 5 To illustrate the point, it will be assumed that the population is divided into L strata with 2 units selected in the sample from each stratum, so that n = 2L. If the within strata variances are equal, the within-strata sum of squares will be
Since there are L differences (xh1 - xh2), each of which contributes one degree of freedom, the estimate of er� will be (8.2)
Obviously, one should not lose sight of the limitations inherent in the use of (8.2). It has been assumed here that the within-strata variances are equal. This is not usually so. However, the equation (8.2) can be applied in such cases as well, provided strata sizes are equal. 5
11
Cf. F. Yates. Op. cit., p. 206.
154
SAMPLING METHODS AND CENSUSES
An adaptation of this technique might be very useful in surveys where the number of strata is large as well as the size of the sample selected from each stratum. The adaptation would consist in selecting a sub sample of 2 units from each stratum and proceeding. as above. The limi tations jn the use of this technique are obvious from the preceding paragraph. The use of differences in defining and estimating variances of various sample designs has been extensively dealt with by Thionet. 6 Thionet has pointed out that in the case of data for 3 units, viz., x1 , x2, x3 with 1 x = 3 (x1 + x, + xJ, the sum of squares of deviations from the arithmetic mean can be expressed as follows:
�-fy +�-fy +�-fy =+�-�+�-�+ + (xa - x1)•]
f
leading to the variance a2 =
[(x1 - x.)2
+ (x
2
-
x,)2
+ (x
3
-
x1)
1
]
The variance of a population consisting of N units can also be defined in the same way. With x1 and x1 being two different numbers in the series of N values of a characteristic, the variance is defined as a2 =
1 � (x1 - x1)2 NZ
with the summation extending over
N(N-1)
(8.3) . .
d1stmct or nonrepeated 2 squares of differences (x1 - x1) 2• If the average of these squares of dif ferences is designated by tl, viz. : 2 x; xi)• tl = � ( N(N- 1) one can also write N- 1 tl a'=---
N
SI=
2
� 2
• P. Thionet. Les pertes d'information en theorie des sondages, Paris, lnstitut natio nal de la. statistiquc ct des etudes economiques, 1959. Etudes theoriqucs No. 7.
155
SAMPLING ERRORS
If a sample of n units is selected without replacement from the popula tion of N units, an estimate of the variance (8.3) is obtained by means of � ', which stands for an estimate of �- An unbiased estimate of � is 2 2 I �' = � (x·I - x·) (8.4) n (n - I) n (n - I) distinct differences in the 2 sample. An estimate of the variance s2 is now defined as
where the summation extends over
s2 =2 �I
with
£s2 = s2
N = --N-I
c;2
The extension of these results to the case of stratified samples under the conditions and limitations discussed earlier in connection with the equation (8.2) is obvious. If the number of units selected from each stratum is not large, so that all the distinct differences and their values can be es tablished easily, the use of differences will make it possible to obtain quick estimates of variances. In surveys with large strata samples, sub sampling may be found to be practicable, as shown earlier. Another result, based on a similar approach, is that of Keyfitz.7 If the population is stratified into a large number of strata and 2 units from each stratum are selected in the sample with replacement, the variance of the sum of the values of characteristics for the 2 units in stratum h will be a2 \ Xh1
+ Xh l 2
= E lxh,
+
xh2 - E (xh1
= E r(xhi - Exh 1)
+
xh2)r
+ (xh2 - Exh2)r
= E [(xh i - Exh1) - (xh2 - Exh2)r = E lxh1 - xh2r
(8.5)
1 N. Keyfitz. Estimates of sampling variance where two units are selected from each stratum, Journal of the American Statistical Association, Vol. 52, 1957, p. 503-510.
156
SAMPLING MllTHODS AND CENSUSES
The extension of the result on the sum of x values for all the units included in the sample follows at once. Strata samples are independent and so one has (8.6) This tec hnique c an be used in reduc ing the amount of computation with designs of various types. Another useful technique in estimating the variance might be to divide the total size of the sample n into t random groups of k units each and afterward obtaining the quantities
=
total of the x values in the g-th random group,
with Xg; being the value of the c harac teristic s for the i-th unit in the g-th group. The next step is to c ompute the estimate l: (x, - xs)2
s2= _K_____
k (t - I)
(8.7)
which is an unbiased estimate of the population per unit variance. 8 In fact, in sampling with replacement Es2 = cr2 and in sampling without replacement Es2 = S2 N
-
2 L (X·-X) 52 = --' ---
N-1
8 Derivation of the formulas and proofs as well as further details on this method of estimating population variance will be found in Morris H. Hansen, William N. Hurwitz and William G. Matlow. Op. cit., Vol. I and 2, Chapter JO.
SAMPLING ERRORS
157
If this method is to be used, the sample should be large enough so that a relatively large number of groups can be formed. Where t = 2, one would lose considerably as compared with the usual method of esti mation based on (n - I) degrees of freedom. On the other hand, the method of random groups might have some advantages when estimating variances· from highly skewed distributions. Group totals will approach normal distribution as the size of groups increases. This fact is impor tant from the point of view of the precision of the estimated variance. This may also be considered as some compensation for the losses men tioned. In the case of many characteristics on the program, the method of ran dom groups will make it possible to reduce the time needed for comput ing variances if an adequate tabulation plan is carried out. This plan requires the following steps to be taken successively: (i) division of the sample into t groups; (ii) tabulation of totals for each characteristic in all the groups; and (iii) application of equation (8.7) to the totals obtained in (ii). When the sample has been divided into t groups of k units each, some units will usually remain. These can be ignored in comput ing variances. The same technique can also be adapted to more complex designs, such as multistage samples. The computation of variances with multistage designs will often be tedious if the information on errors is needed for study purposes, such as the approximation to an optimum design, where the components of variation need to be known. In census work it is sufficient to estimate the total error. This can be achieved much more easily by computing the variance of the estimated totals of primary units or sample totals if an equal overall sampling fraction is· applied. For illustration purposes, we shall assume a population of M primary units with N; secondary units in the .i-th primary unit. A sample of m primary units is selected from the population with replacement and with any probabilities defined in such a way that and
o
......-"'
"'"
-> � .. -..
-0
-0
::=
>.,,
c:
Type of farm groups
-0
c::
>.,,
0
� � u
"'� "'
0
., :n"' "' >
2
3
3
4
3
2
3
3 2
3 3
3
4
3
2 2
2 2
2
3
3 2 2
3 2 2
2 2 4 4 2 2
3 2
3 3
3 2 3 3 3 3
4 3 3
5 3 3
3 3 4 3 3 3
I
I
I
I
I
1
1
x
x
x
x
x
x
2
2
2
2
2
1 x I 1 1
1 x 1 1 1
I x I I 1
I x I I 1
1 x I 1 I
1 1 1
1 1 1
1 1 1
1 1 1
I 1
I 1
1 1
x
x
x
x
x
x
x
x
x
x
x
x
x
.,,
"'�
"'.,,
"'t: ll..
l t
� 0
"' .,
i: c:
� l'.l e .,- -�u_u >·"' c:: c: >
t'd:=
"'o
C:
..
-0
o
-0"
0 "O
..
_ ... ..,,. .,C::t>
- .. "';.::
...
>,
�
>,
.,,
0U ... 0
..,,. >,
ur;;
u
groups
.,,,
C:te
·.;
'3 0 ll..
;.::Jci,
c,"'
:::E §
Ci
C:-c,
3
2
3
3
3
2
5
3
3
3
3
3
3 3
3 3
3 3
3 2
4
3
3
2 2
3
2 2
3 3
2 2
3
3
2 4
3
4
3
3
3
3
4 3 2 3
I
1
1
1
I
1
I
x
x
x
x
x
x
x
x
2
2
2
2
2
2
2
2
·2
I x 1 1 1
I x 1 1 I
I x 1 1 I
I x 1 I I
1 x 1 l I
1 x
I
I x 1 1 1
I I 1 I
l
1
1 1 1
1 I
I 1
1
I
I
I I
I I
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
1
3
3
3
3
3
3
3 3
1
1
x
x
2
2
x I
I
1 x I
1
I
I
1
I
1 1
l
3 3 3
1
1 1
4 4
3 3
4
3
3
3
3 3
3 3
3
3
I 1
x 1 1 I
1
l
I
1
1
Statistics by subjects, Washington, D.C., U.S. Government Printing Office, 1952.
3 3
1
x
I
180
SAMPLING METHODS AND CENSUSES
For this purpose the table has an entry for the level of variation found in Table 38 and the number of farms involved as given by the census report. The foregoing example shows that several possibilities exist of finding a summarized presentation of sampling errors even in agricultural sur veys and censuses. In the field of population censuses the preparation of a summarized table of errors has already become a routine owing to the fact that the situation is much simpler. The intention in giving the above example was not to create an impression that something similar can easily be done in dealing with continuous variables. A glance at the last two tables shows clearly that a great deal of work was needed to prepare them. This involves, first, the finding of a satisfactory solution to the problem of an approximate evaluation of sampling errors from a design which may by no means be a simple one. It also means an actual computation of a large number of errors for a large enough number of areas. This is needed as a basis for determining the minimum number of regions to be used in the presentation of errors. Homogeneous re gions from the point of view of the magnitude of sampling errors can only be established after actual errors for many census items are available by much smaller areas. Then, once such regions are established, the next step is to complete the picture of errors for all the items covered by various groups or classes of the population as in the headings of Table 38. At this stage, considerable work requiring time and abundant facilities may again be necessary before all the details are available for constructing tables similar to 37 and 38. At the same time, the value of the work does not end with a single census. Much of it will be of lasting value, thanks to the stability of variation, in allowing the same results to be used in designing sample surveys and assessing the magnitude of their sampling errors.
9. CENSUSES AND SUBSEQUENT SURVEY WORK
Changes in time of some characteristics In almost all the censuses there will be some items which are charac terized by a relative stability. An illustration is presented in Table 39 which shows the total area of agricultural holdings in Belgium, classified into owned rented land, as obtained in a number of successive censuses of agriculture taken over the last hundred years. It can be seen that in spite of some variation, the percentage distribution is basically stable, although the period involved is long. TABLE
39. -
OWNED AND RENTED LAND IN AGRICULTURAL HOLDINGS IN SUCCESSIVE CENSUSES OF AGRICULTURE IN BELGIUM 1
Land in farms Year Owned
1846 1856 1866 1880 1895 1910 1929 1950
Rented
Total
. . Hectares .. . Percentage . . . Hectares .. . Percentage . . . Hectares ..
613 571 628 292 642 721 713 059 596 331 555 270 726 834 604 968
34.22 34.32 32.68
35.95
31.11 28.38 38.12 33.31
1 179 583 1 202 225 1 323 958 1270 511 1 320 359 1 401 212 1 179 764 1 211 335
65.78 65.68 67.32 64.05 68.89 71.62 61.88 66.69
1 793 154 1 830 517 1 966 679 I 983 570 1 916 690 1 956 482 1 906 598 1 816 303
1 Data presented in this table arc taken from: Recensement general de /'agriculture de 1950, Tome I. Bruxelles, l nstitut national de statistique, 1953, p. 45.
Jn censuses of population, such items are age and sex distribution of the population, distribution by religious affiliation and color, fertility and mortality characteristics, etc. With regard to this type of items, the census results truly represent the state of the characteristics concerned, even if available to the public
182
SAMPLING METHODS AND CENSUSES
after a delay of several years from the initial enumeration. In the same way, such results are equally adequate for practical purposes, as a basis for policy decisions and economic measures. On the other hand, many other characteristics change more consider ably from one year to another, from season to season. Two types of such changes will be distinguished here, viz., year-to-year changes and seasonal or within-year changes. An illustration of year-to-year changes is presented in Table 40. It is seen from this table that the variations from one year to another are quite considerable; the information collected in any given year cannot be taken as a reliable basis for practical measures and various decisions in subsequent years.
TABLE
40. -
AVERAGE YIELD OF WHEAT PER HECTARE IN YUGOSLAVIA 1
Year
Quintals per hectare
1957 1958 1959 1960 1961 1962 1963
15.8 12.3 19.4 17.3 16.J 16.5 19.3
1 Data in this table are taken from an unpublished paper by M. Petrovich and A. Stanoievich. The reliability of data on production of wheat. Belgrade. 1964.
An illustration of seasonal variations will be found in Table 41, which presents the estimates of the labor force employed in agriculture in the United States during 1955 and 1956. In this table the total employment is given first and next the number of males and females employed, be cause the variations are different if the total is subdivided by sex. Changes of each series of absolute numbers are expressed in a relative form by considering January figures for the two years as 100. Obviously, census data obtained at any time for items subject to seasonal changes of this type are much less useful than in the case of stable phenom ena. If the census is taken during the winter, figures obtained for agri-
183
CENSUSES AND SUBSEQUENT SURVEY WORK
cultural employment will show approximately the minimum, and the use of such information will thus be limited in many respects. The same holds true if the census is taken at any other period of the year. TABLE 41. • TOTAL NUMBER OF PERSONS EMPLOYED IN AGRICULTURE IN THE UNITED STATES DURING Tiii! PERIOD 1955/56 1 (14 years of age and over) Total
Date
1955 8 January 12 February 12 March 9 April 14 May 11 June 16 July 13 August 17 September 15 October 17 November 10 December 14 18 17 14 12 16 14 18 15 13 17 15
1956 January February March April May June July August September October November December
Males
.......................
Females
Thousand . . . . . . . . . . . . . . . . . . . . . . .
5 297 5 084 5 692 6 215 6 963 7 681 7 704 7 536 7 875 7 905 6 920 5 884
100 96 107 117 131 145 145 142 149 149 131 111
4 753 4 621 5 023 5 287 5 622 5 982 6 075 5 980 5 971 5 942 5 585 5 000
100 97 106 111 ll8 126 128 126 126 125 118 105
544 464 669 927 I 342 I 700 1 629 I 556 1 904 1 962 I 336 884
100 85 123 170 247 312 299 286 350 361 246 162
5 635 5 469 5 678 6 387 7146 7 876 7 700 7 265 7 388 7 173 6 192 5110
100 97 101 113 127 140 137 129 131 127 110 91
4 892 4 766 4 867 5 348 5 562 6013 5 926 5 676 5 490 5 419 5022 4 358
100 97 99 109 ll4 123 121 ll6 ll2 lll 103 89
743 703 8ll 1 039 I 584 I 863 1 775 I 589 1 898 1 754 l 171 752
100 95 109 140 213 251 239 214 255 236 158 101
SouRcE: Current population reports: labor force, Washington, D.C., U.S. Bureau of the Census. 1957. Series P-57, No. 174.
Change surveys In securing information on changes, two different approaches could be used. The first consists in the establishment of current statistics. At some stage of statistical development the establishment of current statistics is necessary for all the characteristics which change accord ing to a more or less irregular pattern, so that information available for
184
SAMPLING METHODS AND CENSUSES
a base period or another point of time cannot be used to make reliable predictions of the stage of these characteristics at some other time. To illustrate, one might refer to Table 40. In 1958, the average yield was 12.3 quintals. This figure indicates very little about the yield in the next year, when it jumped to 19.4 quintals. In situations of this type, which are very frequent in the social and economic life of each country, current statistics provide fresh data on changes. Needless to say, the number of characteristics included in the program of current statistics and the frequency of points of time to which data refer depend upon the needs for data and vary from one country to another. For example, in some countries labor force surveys take the form of monthly surveys, while in other cases they are taken once in three or six months. The alternative approach consists in taking change surveys. This term is used in this study to designate the type of occasional surveys taken with a view to estimating a change that took place in some characteristics in a period of time starting with a point considered as a base. Such a base point might be a census. For example, the change survey may be intended to provide an estimate of the change in the number of livestock or employment between the census and some other point. Current statistics is a systematic approach in collecting data on changes as fresh data are supplied according to some fixed time schedule. In contrast to this, change surveys are visualized as one time or as non repetitive types of surveys which are taken occasionally at particularly op portune moments. As such, change surveys do not represent a suitable approach in collecting data on characteristics where the information obtained in one year does not have much value in subsequent years. Exam ples of such characteristics are yields, areas, livestock in some countries, etc. Change surveys are useful in dealing with less drastic changes, so that information collected in one year may also be utilized for many purposes in subsequent years as well. A case in point might be the seasonal var iations in the agricultural labor force. If a census of agriculture is taken in winter when agricultural employment is at its minimum, one might wish to take a change survey during the following summer at the peak of agricultural activities in order to have a general idea of the difference. The information obtained would then be used in subsequent years without repeating the survey until another convenient opportunity arises after a number of years. In a similar way, the census of retail shops might be used as a basis for the estimation of variations in the volume and struc ture of sales.
CENSUSES AND SUBSEQUENT SURVEY WORK
185
Change surveys may also be used sometimes as a replacement for some branches of current statistics in countries which cannot supply systematic collection of data or changes. For example, if a country cannot afford current sample surveys of the labor force, one or two surveys taken in the same year as the census of population would give an idea of changes in the period covered. This shows the maximum and minimum state of employment, the relationships between various characteristics at dis tinct periods of the year, differences in the number of hours worked at the two periods, differences in salaries, etc. There are several reasons for combining change surveys with censuses dealing with the same type of items. Firstly, such a combination is useful from the organizational point of view. If a census is to be taken, a law is generally passed beforehand by which the giving of statistical infor mation is made obligatory. A similar law is not usual in connection with sample surveys taken only from time to time. For this reason, any new statistical activity based on the census or connected with it is likely to benefit by the existence of the law as regards all those factors that might be affected by the favorable atmosphere it creates. Also. a large number of enumerators are always trained for the census and this might be a good opportunity to select the best of them for work on change surveys. Besides, many other preparations are also undertaken for the census, such as the division of the country into enumeration districts and the preparation of maps and sketches. If such facilities are available, it is advisable to use them because they can greatly improve the quality of the survey and make the work easier as a whole. The number of fa cilities available in the census is usually great�r than one can expect in independent sample surveys. Consequently, only a relatively small ad ditional effort is needed to supplement census data with information on changes if the latter operation is combined with the census. A second consideration is the increase in efficiency which may result if the census figures arc used as supplementary information in the clwnge surveys. If data on changes are needed for items included on the census program, it is likely that the correlation between census figures for these items and the figures obtained at a later date for the same units as be fore will be high. The gain in efficiency may then be such as would make the combination of these surveys with censuses worthwhile. lf some items in which we are interested are not included in the complete enumeration census program, it might again be possible to use a corre lated variable from the census program for stratification purposes in the
186
SAMPLING METHODS AND CENSUSES
process of estimation or in the selection of units with varying prob abilities. An illustration of what can thus be gained in efficiency will be taken from livestock statistics in the Federal Republic of Germany. A regular census was taken in December 1950 and a sample survey in March 1951. The aim of the survey was to estimate changes in the number and various distributions of pigs. The sample consisted of 186 communes selected from a population of 984, with all the holdings enumerated in the selected communes. The results obtained are shown in Table 42. Column I gives the percentage coefficient of variation of the estimated totals obtained from the March survey independent of the census data. Column 2 presents the same by using the ratio method of estimation with census data as supplementary information. Column 3 shows the increase 111 efficiency due to the use of census data. 1 TABLE 42. • ILLUS1'RATION Of GAIN IN EFFICIENCY BY USING CENSUS DATA AS SUPPLEMENTARY INFORMATION 1 Percentage errors
Characteristics
100 X
(I) U)
I
2
3
TOTAL PIGS Piglets Young pigs
4.0 4.9 3.9
0.9 2.6
1.4
444 188 279
BREEDIN1 OD aha.res. or UICI lallJ for livestock, or (b) Ir no o:,e rents the land or u� the land for livetock, e;\tcr th.e name of the owner of the l:rnd (2)
CENSUS OF Part II
:� w'1ft�·�M�N8P,�(fvis'og�,/:} E J��Does this oenon or ,ny me,r.bir of his hciu,e hold c�le a rarm (or ranct.)?
�,��-
DOC$ it.ii penon or any member or hit household have this ,_,. • Any hop, catllt, or �«:?
�!t
c:Pt com. oats. hay, or k>baW>'t
20 or more
"C "C
� S2
>