Methodological Artefacts, Data Manipulation and Fraud in Economics and Social Science: Themenheft 5+6/Bd. 231(2011) Jahrbücher für Nationalökonomie und Statistik 9783110508420, 9783828205574


169 41 23MB

English Pages 208 Year 2011

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Inhalt / Contents
Guest Editorial
Abhandlungen / Original Papers
The Production of Historical “Facts”: How the Wrong Number of Participants in the Leipzig Monday Demonstration on October 9,1989 Became a Convention
“True Believers” or Numerical Terrorism at the Nuclear Power Plant
One-eyed Epidemiologie Dummies at Nuclear Power Plants
Are Most Published Research Findings False?
What Fuels Publication Bias?
The Identification and Prevention of Publication Bias in the Social Sciences and Economics
Benford’s Law as an Instrument for Fraud Detection in Surveys Using the Data of the Socio-Economic Panel (SOEP)
When Does the Second-Digit Benford's Law-Test Signal an Election Fraud?
Difficulties Detecting Fraud? The Use of Benford's Law on Regression Tables
Plagiarism in Student Papers: Prevalence Estimates Using Special Techniques for Sensitive Questions
Pitfalls of International Comparative Research: Taking Acquiescence into Account
Buchbesprechungen / Book Reviews
Recommend Papers

Methodological Artefacts, Data Manipulation and Fraud in Economics and Social Science: Themenheft 5+6/Bd. 231(2011) Jahrbücher für Nationalökonomie und Statistik
 9783110508420, 9783828205574

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Methodological Artefacts, Data Manipulation and Fraud in Economics and Social Science

Edited by Andreas Diekmann

With Contributions by Arminger, Gerhard, Wuppertal Auspurg, Katrin, Konstanz Bauer, Johannes, Munich Coutts, Elisabeth!, Zurich Diekmann, Andreas, Zurich Franzen, Axel, Bern Greiser, Eberhard, Musweiler Gross, Jochen, Munich Hinz, Thomas, Konstanz Jann, Ben, Bern

Lucius &c Lucius · Stuttgart 2011

Krämer, Walter, Dortmund Krumpal, Ivar, Leipzig Mack, Verena, Konstanz Näher, Anatol-Fiete, Leipzig Opp, Karl-Dieter, Leipzig Schräpler, Jörg-Peter, Bochum Shikano, Susumu, Konstanz Vogl, Dominikus, Bern Wagner, Michael, Cologne Weiß, Bernd, Cologne

Anschrift des Herausgebers des Themenheftes Prof. Dr. Andreas Diekmann ΕΤΗ Zürich - Soziologie CLUD 3 Clausiusstrasse 50 8092 ZÜRICH SCHWEIZ [email protected]

Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.d-nb.de abrufbar ISBN 978-3-8282-0557-4

© Lucius & Lucius Verlagsgesellschaft mbH · Stuttgart -2011 Gerokstraße 51, D-79184 Stuttgart Das Werk einschließlich aller seiner Teile ist urheberrechtlich geschützt. Jede Verwertung außerhalb der engen Grenzen des Urheberrechtsgesetzes ist ohne Zustimmung des Verlags unzulässig und strafbar. Das gilt insbesondere für Vervielfältigungen, Übersetzungen und Mikroverfilmungen sowie die Einspeicherung und Verarbeitung in elektronischen Systemen.

Satz: Mitterweger & Partner Kommunikationsgesellschaft mbH, Plankstadt Druck und Bindung: Neumann Druck, Heidelberg Printed in Germany

Jahrbücher f. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011 ) ßd. (Vol.) 231/5+6

Inhalt / Contents Guest Editorial

596-597

Abhandlungen / Original Papers Opp, Karl-Dieter, The Production of Historical "Facts": How the Wrong Number of Participants in the Leipzig Monday Demonstration on October 9, 1989 Became a Convention Krämer, Walter, Gerhard Arminger, "True Believers" or Numerical Terrorism at the Nuclear Power Plant Greiser, Eberhard, One-eyed Epidemiologic Dummies at Nuclear Power Plants. A Reply to Walter Krämer and Gerhard Arminger's Paper "True Believers" or Numerical Terrorism at the Nuclear Power Plant' . . . Diekmann, Andreas, Are Most Published Research Findings False? Auspurg, Katrin, Thomas Hinz, What Fuels Publication Bias? Theoretical and Empirical Analyses of Risk Factors Using the Caliper Test Weiß, Bernd, Michael Wagner, The Identification and Prevention of Publication Bias in the Social Sciences and Economics Schräpler, Jörg-Peter, Benford's Law as an Instrument for Fraud Detection in Surveys Using the Data of the Socio-Economic Panel (SOEP) Shikano, Susumu, Verena Mack, When Does the Second-Digit Benford's Law-Test Signal an Election Fraud? Facts or Misleading Test Results . . . . Bauer, Johannes, Jochen Gross, Difficulties Detecting Fraud? The Use of Benford's Law on Regression Tables Coutts f , Elisabeth, Ben Jann, Ivar Krumpal, Anatol-Fiete Näher, Plagiarism in Student Papers: Prevalence Estimates Using Special Techniques for Sensitive Questions Franzen, Axel, Dominikus Vogl, Pitfalls of International Comparative Research: Taking Acquiescence into Account

598-607 608-620 621-627 628-635 636-660 661-684 685-718 719-732 733-748 749-760 761-782

Buchbesprechungen / Book Reviews Aoyama, H., Y. Fujiwara, Y. Ikeda, H. Iyetomi, W. Souma, Econophysics and Companies. Statistical Life and Death in Complex Business Networks Postler, Andreas, Nachhaltige Finanzierung der Gesetzlichen Krankenversicherung Ramser, Hans J., Manfred Stadler (Hrsg.), Marktmacht Bandinhalt des 231. Jahrgangs der Zeitschrift für Nationalökonomie und Statistik Contents of Volume 231 of the Journal of Economics and Statistics

783 784 785

Jahrbücher! Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart2011) Bd. (Vol.) 231/5+6

Guest Editorial Falsified interviews and data threaten the validity of empirical social science, as do unintentional and systematic errors in survey design, sampling, econometric analysis, and experimental lab research. While problems of data falsification are supposedly less common than in biomedical publications, they may still be significant enough to threaten the integrity and the trust placed in scientific research. New diagnostic techniques and replications of experimental, econometric and survey research can improve data quality and identify errors in economics and social science publications. This issue comprises 11 papers, 10 articles and one comment, focussing on the issues of statistical artefacts, diagnostics of errors and falsification, and suggestions for improving the validity of reported scientific results. The articles have been peer reviewed and most of the articles were reviewed anonymously, including the editor's contribution. The first article, by Karl-Dieter Opp, is a case study on how an estimate of the number of participants in the famous Leipzig protest movement became a social fact. The analysis of this historical example is instructive and reminds us of the need for being more cautious with the acceptance of social facts before starting theorizing, model building and econometric analysis. Two articles, by Walter Krämer and Gerhard Arminger and by Andreas Diekmann, raise the question of misleading results by employing routine practices of significance testing. Kramer and Arminger's provocative article is commented on by Eberhard Greiser, who defends his survey and analysis of the associations between living near nuclear power plants and leukemia. Energy giant RWE ordered 1,000 preprints of the "True Believer" article but not the comment by Eberhard Greiser. 1 Ironically, after the Fukushima catastrophe, the preprints are worthless for propaganda use anyway. Here, it is not the 'pros' and 'cons' of nuclear energy that are at stake but rather the use, misuse and interpretation of significance testing. A serious problem with published research and meta-analysis is the preference of authors and editors for significant results. If reported findings in professional journals are highly selective in favour of significant results, meta-analyses based on published articles become distorted and misleading. Both the article by Katrin Auspurg and Thomas Hinz and that by Bernd Weiss and Michael Wagner focus on publication bias and methods to detect and combat such bias. Weiss and Wagner emphasize meta-analysis and funnel plots, while Auspurg and Hinz demonstrate how to identify publication bias with the so-called "caliper test". An example from the history of statistics is astronomer and sociologist Adolphe Qutelet's analysis of the distribution of the heights of young French males. There was a surplus below 157 cm and a gap immediately beyond because young men smaller than 157cm were not drafted by the military. The caliper test employs the same logic by a close inspection of the distribution of statistics from tests of significance sampled from professional journals. For example, a gap in the distribution of z-values near the threshold of za = 1.96 serves as an indicator for the degree of publication bias or other measures of data manipulation. Another creative idea is the use of Benford tests for identifying fraudulent manipulations of data. Similar to caliper tests, the method exploits the fact that the digits of numerical research findings often follow a logarithmic Benford distribution. Then, deviations from "Benford's law" may suggest fraudulent data. Jörg-Peter Schräpler's article, Susumu 1

Walter Krämer assured us that the research reported in this issue was not supported by any interested commercial organization.

Guest Editorial • 597

Shikano and Verena M a c k ' s article, and the article authored by Johannes Bauer and Jochen Gross, apply the Benford method to the problem o f interviewer fraud in survey research, election fraud and fraudulent manipulations of tables o f regression results respectively. However, the efficacy of this method is still a matter of controversy. D a t a validity is an important topic in survey research. The articles by Coutts et al. and Franzen and Vogl focus on this issue. Elisabeth Coutts, Ben J a n n , Ivar Krumpal, and Anatol Fiete-Näher report on results from a survey on plagiarism among students. Their main interest is testing and comparing various techniques for improving the validity of answers to sensitive questions. These methods may be of importance for collecting data on self-reported scientific misconduct. Axel Franzen and Dominikus Vogl challenge Riley Dunlap and Richard York's finding that the population of poor countries bears an even higher degree of environmental concern than the population in rich countries. They argue that part of the correlation is due to a systematic bias of intercultural variations in response patterns. International survey programs such as the World Value Survey, the European Value Survey, the International Social Survey Program and others are frequently used sources of data for cross-national comparisons and econometric analysis. Therefore, it is important to place more emphasis on problems in the data collection process of international surveys and, likewise, to evaluate the validity and comparability of the data more carefully. This special issue deals with a few selected aspects of a broad range of topics concerning the diagnosis of data manipulation, systematic statistical error and data validity. O f course, the debate must be continued. Hopefully, critical investigations such as those presented in this issue will contribute to the improvement of social science methodology. Andreas

Diekmann

Jahrbücher f. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011) Bd. (Vol.) 231/5+6

Abhandlungen / Original Papers The Production of Historical "Facts": How the Wrong Number of Participants in the Leipzig Monday Demonstration on October 9,1989 Became a Convention By Karl-Dieter Opp, Leipzig* JEL C8 Leipzig demonstrations; Leipzig demonstration on October 9,1989; East German Revolution; historical "facts"; spread of false beliefs; survey research; reliability of official data; faked data; negligent data handling.

Summary This paper deals with the demonstration in Leipzig on October 9, 1989, an important episode in the history of the East German Revolution. It is generally held that 70,000 demonstrators participated. This paper shows that this number is clearly wrong. The paper describes briefly the results of a survey that were inconsistent with this number and how the authors of the study proceeded to make a new estimate. The paper further outlines how the original estimate was made and found its way into the media and historical accounts. Finally, some general lessons are drawn from the case. The case study this paper focuses on is not an example of the faking of data, but rather of negligent data handling. However, it is argued that the lessons from this case discussed in the final section hold for faked data as well. Introduction When did the Egyptian pharaoh Tutankhamun live? Approximately 1341 BC to 1323 BC. When did the Peloponnesian war take place? It lasted from 431 BC to 404 BC. H o w many protesters participated in the Monday demonstration in Leipzig on October 9, 1989, widely understood as the breakthrough event in the East German Revolution? The unanimous answer is 70,000 (it is even rare that "approximately" is added). These dates and figures and many others have become established facts - they are reported in history books, on the internet, and in encyclopedias. They have become common knowledge. To what extent can we trust these data? Most of those who share this common knowledge assume that historians and other social scientists have carefully explored all the available evidence to make sure that the reported data are correct. Otherwise, reports would be qualified. But has there really been some effort to validate all these

* I wish t o t h a n k Steven Pfaff (University of Washington, Department of Sociology, Seattle) for valuable comments on an earlier version of this paper. I f u r t h e r t h a n k Andreas D i e k m a n n ( Ε Τ Η Zürich) and an a n o n y m o u s reviewer for helpful suggestions.

The Production of Historical "Facts" · 599

data? If not, is this reported alongside the data? Even if historians and social scientists tried hard to find evidence that confirms the data, couldn't they have made mistakes? If data are problematic, it would seem most likely that the hazard is greatest with regard to data from the distant past, as there will be fewer sources and it is more difficult to assess the validity of those sources. O n e would thus expect that information reported about events of the recent past would be more reliable. W h y would one doubt that, for example, the number o f participants in the Leipzig M o n d a y demonstration on October 9 , 1 9 8 9 was actually 7 0 , 0 0 0 ? This paper will describe in detail an example of a historical " f a c t " that has become common knowledge but that is clearly wrong. T h a t the size of the M o n d a y demonstration in Leipzig on October 9 , 1 9 8 9 was 7 0 , 0 0 0 persons is reported everywhere. 1 H o w can a wrong number become generally accepted if it refers to an important event in recent German history? T h e demonstration is important because it set the stage for the collapse of the communist regime in East Germany (see below). 2 W h a t is still more surprising is that it has been known since 1 9 9 3 that the 7 0 , 0 0 0 number is wrong (Opp et al. 1 9 9 3 : 4 7 ) . Nonetheless, it has remained generally accepted. To facilitate understanding of the remainder o f this paper, it seems first useful to describe briefly the historical context, i.e. the situation in the German Democratic Republic ( G D R , i.e. East Germany) in 1 9 8 9 . T h e protests in the G D R in 1 9 8 9 were surprising, because the G D R was one of those communist states that seemed stable due to an extensive system of repression. After the uprising on June 1 7 , 1 9 5 3 , there had been little protest. T h e situation began to change in M a y 1 9 8 9 , when members of the opposition movement found that the published results of the communal election had been tampered with. This led to a series of protests that increased over the course of 1 9 8 9 . T h e largest demonstration in the history of the G D R after the 1 9 5 3 uprising t o o k place on October 9, 1 9 8 9 . M o r e than 7 0 , 0 0 0 citizens - this is the number that has become generally accepted - gathered in the center of Leipzig at the Karl M a r x Square (Karl-Marx-Platz, now Augustus Square - Augustusplatz), even though it was likely that this demonstration would be crushed. However, the demonstration remained peaceful on the side of the demonstrators as well as on the side of the police forces. After this event, protests in Leipzig - the well-known M o n d a y demonstrations - and at other places in the G D R increased and finally led to the collapse of the communist regime. It is not overstating matters to claim that this demonstration was a decisive step toward the breakdown of the communist regime and unification of the two German states. T h e first free elections were held on M a r c h 1 8 , 1 9 9 0 . German unification occurred on O c t o b e r 3 , 1 9 9 0 . 3

1

See, for example, the website for the twentieth anniversary of the demonstration on October 9, 2 0 0 9 http://www.siebzigtausend-in-leipzig.de/. The English translation is "seventythousend-inleipzig.de." This website is under the patronage of the Leipzig m a y o r Burkhard Jung. The website begins: " O n October 9, 1 9 8 9 the M o n d a y demonstration where 7 0 , 0 0 0 participated has its 2 0 t h anniversary." ( " A m 9. Oktober 2 0 0 9 jährt sich die Montagsdemonstration, an der 7 0 . 0 0 0 teilnahmen, zum zwanzigsten M a l . " ) This number is thus regarded as an established fact.

2

This demonstration is also unique for another reason: even if we assume that the 7 0 , 0 0 0 estimate is correct, this is the highest mobilization rate in urban rebellions in history. F o r details, see Pfaff 2 0 0 6 : 284-285. There is an extensive literature on the East German revolution. In English see, for example, Bartee 2 0 0 0 ; L o h m a n n 1 9 9 3 , 1 9 9 4 ; O p p et al. 1 9 9 5 ; O p p 1 9 9 3 , 1 9 9 4 ; Opp/Gern 1 9 9 3 ; Pfaff 2 0 0 6 .

3

600 · Karl-Dieter Opp

The nudge to examine the 70,000 number The number of 7 0 , 0 0 0 participants was reported in the media shortly after the demonstration on October 9 , 1 9 8 9 , and quickly became universally accepted. Newspapers first reported it, and it was then taken over by historians, social scientists and members of political parties and other organizations, including museums like the "Zeitgeschichtliches Forum" (forum for contemporary history) in Leipzig. Furthermore, pictures of the demonstration at Karl Marx Square were not inconsistent with such a number: they show a big crowd densely packed. So there seemed to be no reason to become suspicious about the 7 0 , 0 0 0 number. However, there was in fact a good reason to doubt the reported number and it lay in the unexpected findings of a research project that was funded in 1 9 9 0 by the German National Science Foundation (Deutsche Forschungsgemeinschaft). The goal of this project was to explain the peaceful revolution in East Germany in 1989/1990. This project included a representative survey of the Leipzig population, administered in the fall of 1990. 4 Most of the questionnaire referred to the situation in the fall of 1989. Among other things, the respondents were asked whether they had participated in one of the Monday demonstrations and, if so, in which one. Among the 1 3 0 0 respondents, 1225 gave a valid answer. 3 2 0 respondents said they had participated in the demonstration of October 9. These respondents make up 2 6 % of the sample. Yet this finding was inconsistent with the conventional estimate of the size of the protest. At the end of 1989, Leipzig had a population of 5 3 0 , 0 1 0 , according to the GDR yearbook of statistics. 7 0 , 0 0 0 would be 13.21 % of this population. This is half of the number in our survey. But 13.21 % is not correct, because the 7 0 , 0 0 0 would also include some participants that were not residents of Leipzig. Participants also came from outside, but nobody knows how large this number was. Thus, there were certainly fewer than 7 0 , 0 0 0 residents of Leipzig who participated. So, let us assume there were 6 0 , 0 0 0 participants from Leipzig itself. This is then a percentage of 1 1 . 3 2 % (60,000 χ 100/ 530,010). An estimate of 13.21 % is thus too high. A further correction must be made: perhaps we should take into account only those residents that are 15 years and older, making a total of 4 4 0 , 1 5 6 . When we take this as the basis for our computation, the percentage would be 7 0 , 0 0 0 χ 100/440,156 = 15.9 % . This is still clearly lower than the number reported in our survey. Anyway, there is a clear and substantial difference between the number of participants extrapolated from our representative survey and the unanimously accepted "fact" of 70,000. How can this difference be explained? When we first realized the discrepancy between the number of protesters in our survey and the 7 0 , 0 0 0 number, we thought that something was wrong with the survey. Thus, at this point we too took the 7 0 , 0 0 0 number as given and asked ourselves, "what could have gone wrong with the data collection?" There were several possibilities, as outlined below: 4

The research in the fall of 1990 was only the first wave of a panel with three other waves in 1 9 9 3 , 1996 and 1 9 9 8 . For the first wave, see Opp et al. 1 9 9 3 (English 1995). There are numerous publications about this project. See also Opp 1 9 9 3 , 1 9 9 4 ; Opp/Gern 1993. See most recently Opp/Kittel 2 0 1 0 ; Opp/Brandstätter 2 0 1 0 . These articles include further references. A book in German based on waves 1 and 2 is Opp 1997. There is no other panel data set in which the same individuals were interviewed under communist rule and several times after unification.

The Production of Historical "Facts" · 601

Social desirability and response bias. At the time the survey was administered, it seemed that having participated in the Monday demonstrations conferred some status on respondents. Thus, a respondent who had to admit to not having participated at the decisive Monday demonstration on October 9 might feel embarrassed and therefore deliberately give a wrong answer to the interviewer. An explanation for the high number of participants in the survey could thus have been that the respondents overstated their participation: a large number of non-participants could have said they had participated because this was a socially desirable answer. We anticipated this reaction and took some precautions. The questions about protest participation (including participation in the Monday demonstrations) were asked in a separate self-administered questionnaire: at some point during the interview, respondents were asked to fill out a questionnaire that was handed to them together with an envelope. The respondent then put back the completed questionnaire in the envelope and closed it, meaning the interviewer could not see the answer. Furthermore, the respondent was assured that the interviewer would not open the envelope, so there could be no negative reactions on the part of the interviewer for undesirable answers. Another indicator of the absence of social desirability was the answer to a question about membership in the SED (the communist party in the GDR). Admitting SED membership was not regarded as desirable when our interviews took place. If respondents in general gave socially desirable answers, we predicted that the number of respondents who admitted SED membership should have been lower than the real number. We got the actual number of SED members before October 9,1989 from the local PDS (the successor party of the former SED) party office. However, the actual numbers and the numbers extrapolated from the interviews did not differ. This is an important indicator of the absence of social desirability, because the question about SED membership was posed during the interview and not in the self-administered questionnaire. There might be other response biases but in this case they seem implausible. For example, people may not remember whether they participated or did not participate, and for some reason non-participants reported participation. This is implausible because participation in the demonstration of October 9 was unique, highly consequential and emotionally resonant. According to psychological research, one would expect that such events are easily remembered. In general, it cannot be ruled out that there are biases in the data, but our detailed analyses suggest that this is implausible (see in particular the discussion in Opp/Gern 1993). We are therefore confident that the big gap between the official number of participants and the survey results cannot be explained by response bias. Selection bias and over-representation. Another possibility is that the high number of demonstrators in the survey is due to the self-selection of the respondents: those who participated might have been more interested in being interviewed and therefore accepted being interviewed in greater numbers than those who did not. Although the procedure for selecting the respondents was random, self-selection could have led to the high number of participants because the response rate was about 40 % (see Opp et al. 1995: Appendix). However, we tested to what extent the distributions of various demographic characteristics in the sample coincided with the distributions in the total population of Leipzig, based on official statistics. The result of these comparisons was that it seems reasonable to state that the selection of the respondents was not biased.

602 · Karl-Dieter Opp An over-representation of protesters in the survey might also originate by chance. Assume there is a population of 1000 with 100 protesters (10 %). Let 300 different probability samples be drawn, each with 50 respondents. It is possible that one of the samples consists only of protesters, but the likelihood of selecting such a sample is very low. Another possibility is that one of the samples consists of 15, i. e. 30 % protesters. Even if we expect most of the 300 samples to consist of 10 % protesters, such deviations are possible. However, our analyses that compared different demographic variables of the sample and the population of Leipzig indicate that there is a good representation of these variables in the sample. This is an indicator that the number of demonstrators is not overrepresented either. Conclusion. Our empirical and theoretical analyses of the data suggest that the large number of participants in our sample was not a bias in the data or due to sample selection. So, how can our higher estimate of participants be explained? We were unsure and it was only then that we considered the possibility that the official number of 70,000 participants might be wrong. As was said before, this idea came late in our analysis of the data. This suggests how trustworthy official data are, and how people tend to accept them. This is also shown when I discuss the project in seminars, lectures and talks. I usually report the gap between the survey results and the official number. Then I ask how the discrepancy could be explained. It is very rare that a student or someone in the audience asks whether the number of 70,000 might be wrong. Thus, official numbers seem to be so trustworthy that their accuracy is taken for granted.

How to assess the size of a demonstration after it has happened After taking into account this new idea that the official number might be wrong, we tried to find out how the 70,000 number was estimated. We first called various newspapers, as well as the German Press Agency (Deutsche Presseagentur). Nobody could tell us how the number of 70,000 came about and what the source of this number was. The Leipzig police - that often publish estimates - did not have any information either. Was there any other way of checking this number more than a year after the demonstration took place? 5 Two pieces of information were required (Opp et al. 1995: 24). One is the size of the area in square meters where the demonstrators stood. The second is how many demonstrators stood together in a square meter. If these questions can be answered, then the size of the demonstration can easily be computed: one may simply multiply the square meters by the number of participants per square meter. How did we get the information about the area where the demonstrators stood? We found numerous photos taken during the demonstration on October 9, 1989, showing Karl Marx Square (where the demonstration took place) and the adjacent streets (such as Grimmaische Straße). Participants also told us where demonstrators stood. Peter Voß (one of the three authors of Opp et al. 1995) and I paced the area where the demonstrators stood. The size of this area is 41,500 square meters. 5

In seminars and lectures with students where the case discussed in this paper was presented, I usually ask how one could check the number long after the demonstration took place. Very few students come up with the procedure that will be described shortly.

The Production of Historical "Facts" · 603

So, how many participants stood together in a square meter? The photos show that four persons per square meter is a good estimate. If we assume four persons per square meter, the number of participants was 4 χ 41,500 = 166,000. But if we assume only three persons per square meter, there may have been 124,500 participants. There can thus be no doubt that the number of 70,000 participants at the Leipzig Monday demonstration on October 9, 1989 is plainly wrong. Actually, the number of participants must have been between 124,500 and 166,000. It was certainly not as low as 70,000. It should be said that our procedure is not as original as we first thought. Our inquiries indicate that the police always proceed in the same way when the size of gatherings is estimated. However, as mentioned above, this did not happen for the demonstration on October 9, 1989. How the 70,000 estimate was established

Those who accept the 70,000 number might still be skeptical in regard to the previous arguments. The missing link between our survey results and the measurement device for the real number of demonstrators would be information about how this mystical 70,000 estimate was created and became common knowledge. A plausible hypothesis is that there was some trustworthy or high-status individual or group that first generated this number and then circulated it. It was a lucky coincidence that Helena Flam, Professor of Sociology at the University of Leipzig, where I had been a professor from 1993 to 2002, undertook a survey of members of the opposition groups in Leipzig. On May 30, 1996, she sent me a transcript of an interview that addresses how the 70,000 estimate and those of other demonstrations originated. This interview has not yet been published and so what follows is a summary of the text that is of interest in this context. To understand the following account, one must know that the Monday demonstrations took place after the peace prayers in the Nikolai Church, which were held on Mondays from 5pm to 6pm. This church is close to Karl Marx Square. After the peace prayers, participants walked from the church to the square and through the large streets that surround the city center. Other participants joined in who had not taken part in the peace prayers. Here is the story of how the 70,000 number came about. A member of the opposition groups with close contacts to the parish of the Nikolai Church reported in the interview that, on October 9, 1989, he and some others participated in the peace prayers. They then went to the parish office, where others were already present. From there they had a view out on the Nikolaikirchhof, the square next to the Nikolai Church. Everybody was surprised that so many people had gathered. The group then joined the demonstrators for some time, walked around the inner city, and went back to the office. Representatives of the media from Berlin who wanted to get information about the Monday demonstration, and especially about the number of demonstrators, called the office. Here is what one of the interviewees said (my translation, the conversational style is polished): "We had to come up with a number. People called us, the media wanted a number. How many people came to the demonstration? Four of us then sat together. One said there are 50,000. Another one said 90,000. We then met somewhere in the

604 · Karl-Dieter Opp

middle and made public 70,000. That is the number that is now reported in the media." Another interviewee said: "We do not know at all whether this was ever right." The same interviewee continues: "We did the same thing every Monday. We then took a bike and rode in the direction of the demonstration and always made guesses. We then called R. in Berlin 6 and he always said: there must be more than last Monday. Then next Monday we said 9 0 , 0 0 0 or so. Whether this was correct, nobody knows." These interviews show that the estimate of 7 0 , 0 0 0 was not based on a reliable procedure like the one that we applied. It was just intuition. Furthermore, it cannot be assumed that this intuition was based on some training. Let us assume that a member of the police who has provided estimates of many gatherings (based on photos) had been asked to assess the size of the Monday demonstrations. We would trust such estimates more than the estimates of opposition members who did not have such training. Indeed, when judging the correctness of the 7 0 , 0 0 0 estimate in this case, it is also important that the interviewees themselves indicate how questionable their estimates were. Moreover, it is also important to note that estimates of the size of the other Monday demonstrations are not valid either. This is also in line with our data (for details see Opp et al. 1995: 2 5 7 - 2 5 8 ) . However, since this is not the topic of this paper we will not go into these estimates any further.

What can we learn from the case? This paper describes a case in which a generally accepted historical " f a c t " seemed questionable and did indeed turn out to be wrong: it is generally accepted that in the Leipzig demonstration on October 9, 1989 7 0 , 0 0 0 citizens participated. This is about 13 % to 15 % of the population of Leipzig. Our representative survey suggested a participation rate of 2 6 % of the population, i.e. about 130,000. Based on photos, we paced off the area where demonstrators convened; the photos indicate that there were between three and four persons per square meter. Accordingly, between 124,500 and 1 6 6 , 0 0 0 citizens actually participated. The creation of the 7 0 , 0 0 0 estimate is a case of arbitrary data construction that found its way into history. Without our survey and the lucky circumstance of coming across Helena Flam's interview with members of the former opposition, the 7 0 , 0 0 0 estimate would not have been called into doubt. What then are the lessons from this case for social scientists (including anthropologists, historians, social psychologists, sociologists and political scientists)? (1) If a social scientist uses data, he or she should examine whether there have been efforts to test the validity of the data. Perhaps the most striking feature of the 7 0 , 0 0 0 number is that there does not seem to be the slightest interest and effort to examine the validity of this number. The postulate - check the validity of data that is used - should hold for everybody who reports data to an audience. In the present

6

The name is known to me, the person is a journalist w h o informed the media.

The Production of Historical "Facts" • 605

case, such audiences are not only social scientists, but journalists, writers, organizations such as museums, and the general public. (2) Another lesson we can draw from our case is: don't even trust data (including numbers) that seem firmly established. Be particularly suspicious if the data has become common knowledge and where not the slightest doubt in regard to their validity exists. (3) One should have a very critical look at the source and kind of evidence provided for data. Assume it would have been documented that members of the opposition in 1989 had estimated the size of the Monday demonstrations. The source of the numbers has thus been identified. This is in general the first question that should be asked: who collected or who publicized the data first? Is it a scholar, a governmental agency, a research team, or a private firm? The second question is: how was the data collected? Was the method participant or non-participant observation, or is the data based on formal or informal interviewing or on existing documents? When these questions are answered, social scientists should next apply hypotheses from the literature on methods in order to examine the validity of the data. For example, in judging historical data it is recommended to study first the historian who collected the data, before one begins to study the data (Carr 1961: Chapter I). In other words, the interests and perceptions of the historian should be examined and their possible influence on the kind of data he or she presents. This holds for any data source. For example, assume that a group of scholars evaluate a curriculum. Is there some likelihood that the data are biased (or even faked)? One should look at the commitment to or involvement of the research team in the curriculum. Perhaps the scholars were members of the group who set up the curriculum or openly emphasized its positive features. In this case, there will then be a strong interest not to get negative results of the evaluation. Representatives of a government agency rarely have an interest in presenting unbiased data. Furthermore, those who are interviewed or observed may have an interest in presenting themselves in a positive way. All this is familiar to social scientists. But our case suggests that social science knowledge about conditions for the validity of the data is not always applied. (4) Many social scientists don't trust surveys. If the results of surveys are inconsistent with official data, it is likely that the latter are regarded as credible. This happened in our case as well, as was described above. Our case study is not in line with this belief. The lesson is: if survey data and official data are at odds, each piece of data or data set should be scrutinized. (5) As was mentioned before, evidence that the 70,000 number is wrong has been available since 1993 (Opp et al. 1993: 47). Why is this publication ignored? To be sure, the discussion of the validity of the 70,000 number in the book is very short, not even a page. The story about who brought up the number and how it was estimated had not been published. But scholars interested in the history of the GDR would be expected to look at a book that addresses historically important events in German history. This is especially to be expected because the book provided new data. Perhaps because the book is a quantitative sociological study, historians, journalists and members of organizations did not consult it. Although the book was written in an accessible way - no statistical knowledge was presupposed to understand the theory and the findings - the book seemed still too "scientific," compared to the numerous popular and scholarly books about the GDR that have appeared since 1989. The ignorance of the refutation of the 70,000 number indicates that, even if there is published evidence that some historical "fact" is wrong, it may simply be ignored. A reason might be that the 70,000

6 0 6 · Karl-Dieter O p p

number was so firmly believed that contradictory evidence seemed implausible. Even if there is such a belief, however, a serious scholar should deal with such evidence. We do not know of any discussion of our argument. The lesson could be: if there is conflicting evidence available refuting a historical "fact" that seems evident, resources should be devoted to finding out the truth. (6) Perhaps another lesson is that any social scientists who find new evidence that refutes generally accepted data should communicate their findings to media and organizations that use the numbers (such as directors of museums concerned with German history, etc.). Although this may not be successful because changing beliefs that have been around for decades is costly. But it is worth a trial. Another problem might be that scientists are often not interested in what happens outside their field: communicating research results to media is also costly and, moreover, does not yield status in the scientific community. Nonetheless, perhaps social scientists should write a few pages about new, relevant findings and submit them to newspapers, websites, and individuals who report or make use of the data. (By the way, we did not do this either!) We have not dealt with faked data in this paper. Would it have been necessary to write a completely different paper if the 70,000 number had been faked? Assume those who publicized this number had known the correct number of participants, but intentionally reported 70,000. The first part of the paper before the section " H o w the 70,000 estimate was established" would have been identical but that central section would have been very different. The revelation would not have been that there was a negligent or, less negatively, "cavalier" (Carr 1961: Chapter I) handling of data, but intentional misrepresentation. What about the section "What can we learn from the case?"? If you, the reader, go over this section and imagine that the 70,000 number was intentionally faked, you will nonetheless see that the lessons would have been the same! The conclusion is therefore that the precautions to be taken to detect biased data are the same, howsoever this bias originated, by negligence, unconscious biased perception or conscious fraud. References Bartee, W. C. (2000), A Time to Speak out: The Leipzig Citizen Protests and the Fall of East Germany. Westport, Conn., and London: Praeger. Carr, E.H. (1961), What is History? London: MacMillan. Lohmann, S. (1993), A Signaling Model of Informative and Manipulative Political Action. American Political Science Review 87: 319-333. Lohmann, S. (1994), Dynamics of Informational Cascades: The Monday Demonstrations in Leipzig, East Germany, 1989-91. World Politics 47: 42-101. Opp, K.-D. (1993), Spontaneous Revolutions. The Case of East Germany in 1989. Pp. 11-30 in: H. D. Kurz (ed.), United Germany and the New Europe. Cheltenham: Elgar. Opp, K.-D. (1994), Repression and Revolutionary Action. East Germany in 1989. Rationality and Society 6: 101-138. Opp, K.-D. (1997), Die enttäuschten Revolutionäre. Politisches Engagement vor und nach der Wende. Opladen: Leske + Budrich. Opp, K.-D., H. Brandstättern (2010), Political Protest and Personality Traits: A Neglected Link. Mobilization 15: 323-346. Opp, K.-D., Ch. Gern (1993), Dissident Groups, Personal Networks, and Spontaneous Cooperation: The East German Revolution of 1989. American Sociological Review 58: 659-680. Opp, K.-D., B. Kittel (2010), The Dynamics of Political Protest: Feedback Effects and Interdependence in the Explanation of Protest Participation. European Sociological Review 26: 97-110.

The Production of Historical "Facts" • 607

O p p , K.-D., P. Voß, Ch. Gern (1993), Die volkseigene Revolution. Stuttgart: Klett-Cotta. O p p , K.-D., P. Voss, Ch. G e m (1995), The Origins of a Spontaneous Revolution. East Germany 1989. Ann Arbor: Michigan University Press. Pfaff, St. (2006), Exit-Voice Dynamics and the Collapse of East Germany: The Crisis of Leninism and the Revolution of 1989. D u r h a m , N C : Duke University Press. Prof. Dr. Karl-Dieter O p p , Universität Leipzig (Emeritus), University of Washington, Seattle (Affiliate Professor). Private address: Sulkyweg 22, 2 2 1 5 9 H a m b u r g , Germany. [email protected]

Jahrbücher f. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011) Bd. (Vol.) 231/5+6

"True Believers" or Numerical Terrorism at the Nuclear Power Plant By Walter Krämer, Dortmund, and Gerhard Arminger, Wuppertal* JEL C10; C12; C52 Significance; data mining; overrejection.

Summary For decades, there has been a heated debate about whether or not nuclear power plants contribute to childhood cancer in their respective neighbourhoods, with statisticians testifying on both sides. The present paper points to some flaws in the pro-arguments, taking a recent study prepared for the political party "Bündnis 9 0 / G r ü n e " as a specimen. Typical mistakes include an understatement of the size of tests of significance, disregard of important covariates and extreme reliance on very few selected data points. 1

Introduction and summary

In the fall of 2 0 0 9 , the G e r m a n political party Bündnis 9 0 / G r ü n e (2009) p r o d u c e d a t e m p o r a r y stir in the G e r m a n media by claiming final proof that nuclear p o w e r plants induce childhood leukemia. " A K W erhöhen das Leukämierisiko (nuclear p o w e r plants increase risk of leukemia)" was the heading of a press release. While not even the metaanalysis by Greiser (2009), which f o r m e d the basis of this press release, has any such claim in it (since Greiser is well a w a r e of the difference between correlation a n d causation), the press release strongly contributed to the fiercely held belief by m a n y G e r m a n s that nuclear p o w e r is bad for you. The present paper shows t h a t presumably not even the correlation claimed by Greiser (2009) does exist. We use his study to exemplify various mistakes that are often m a d e w h e n statistical analyses are guided by strong a priori beliefs which are so typical in the leukemia vs. nuclear p o w e r debate. T h e first a n d most p r o m i n e n t source of error is an u n d e r s t a t e m e n t of the true size of tests of significance which results f r o m the well k n o w n publication bias. We provide a brief survey of this literature a n d s h o w that there is ample reason t o believe t h a t this bias also prevails in the leukemia debate. In technical terms, the true significance level of such tests is m u c h larger t h a n the n o m i n a l one reported in the respective papers. O t h e r mistakes include the disregard of i m p o r t a n t covariables and the heavy reliance on outliers which, w h e n removing t h e m , reverse the patterns observed before. T h e n there is the well k n o w n p h e n o m e n o n called H A R K i n g ("Hypothesizing After the Results are * We are grateful to Marc Drolet and Stefano Grasso for helpful criticism and comments. The title is taken from Dewdney (1996) and refers to misleading media coverage of nuclear power generating plants in the U. S.

"True Believers" or Numerical Terrorism at the Nuclear Power Plant · 609

Known"), where tests of significance are taking place only after some abnormal data have been observed. This seems to apply in particular to the leukemia debate, where many studies were undertaken only after the media had aroused attention to abnormal incidence or mortality close to nuclear installations of various types. Taken together, these deficiencies seem to invalidate any "proof" that nuclear power correlates with childhood leukemia, let alone that is responsible for it. While it might still be true that some such relationship exists, it certainly cannot be derived from the evidence that is available so far. Even if one does not subscribe to the well known Taubes (1995) - thesis that epidemiological evidence of any sort should only be taken seriously if there is at least a twofold increase in the risk observed, one needs much more and in particular much more convincing data before sounding the kind of alarm that is so popular among true believers in science and in the media alike. 1 2

Empirical studies of cancer incidence around nuclear power plants

There is an enormous literature in statistics, epidemiology and public health on childhood cancer, in particular childhood leukemia, in the vicinity of nuclear installations of all sorts. It dates back to a 1982 British television documentary entitled "Windscale: the Nuclear Laundry", which reported an abnormal incidence of leukemia in young people living in the village of Seascale close to the nuclear site of Sellafield, and has subsequently spawned an enormous interest in similar clusters elsewhere. Among studies which did find such clusters, or at least "abnormal" rates of incidence or mortality, are Heasman et al. (1987), Ewings et al. (1989), Clarke et al. (1991), Körblein and Hoffmann (1999) or Hoffmann et al. (1996, 2007), just to name a few. Alexander (1999) and Laurier and Bard (1998) provide convenient summaries of the earlier literature, and Baker and Hood (2007) and later Greiser (2009) collect many of these studies for meta-analyses which led to similar results. Also - partially - included in these meta analyses were studies which could not find any excess incidence or excess mortality. Because they are so rarely cited, we here present their main conclusions: " N o excess cases were found in small towns around the plant" (Sofer et al 1991: 191). "Our study gives no evidence for an increased risk of childhood leukaemia ... in the vicinity of nuclear installations" (Michaelis et al. 1992: 262). " N o increase of Leukaemia and lymphoma mortality in the vicinity of nuclear power stations in Japan" (Iwasaki. et al. 1995). "We see no statistically significant clustering of the observed cases about the four nuclear power plants in Sweden" (Waller et al. 1995: 14). "There was no evidence of a generally increased risk of childhood leukaemia ... around nuclear sites in Scotland" (Sharp et al. 1996: 823). "Over the entire zone, children do not have an increased risk of malignant haematology disease" (Bouges et al. 1999: 205). 1

"With epidemiology you can tell a little thing from a big thing. What's very hard to do is to tell a little thing from nothing at all." This is a quotation attributed by Taubes (1995:164) to the director of analytical epidemiology of the American Cancer Society.

610 · Walter Krämer and Gerhard Arminger

"Our study shows no evidence of a generally increased risk of childhood leukaemia within 20km of the 29 nuclear sites under study" (White Koning et al. 2004). "There is no indication of any effect on the incidence of childhood cancer" (COMARE 2006: 115). "It is concluded that there is no evidence that acute leukaemia in children aged under five has a higher incidence close to NPSs in Britain" (Bithell et al. 2008: 196). "Neither for the whole study region nor for the individual NPP areas was a statistically significant average observed" (Kaatsch et al. 2008b: 727). "Our results do not indicate an increase in childhood leukemia and other cancers in the vicinity of Finnish NPPs" (Heinävara et al. 2009). In the next section we argue that such studies, i. e. studies which report no effect at all, or no "significant" effect, have much lower chances of being undertaken in the first place and later getting published in the second. Or how often does one stumble on a journal article like "Pet ownership and childhood acute leukemia" (Swensen et al. 2001), which, after protracted investigations, finds that "no relationship was found between exposure to an ill pet and childhood leukemia" (p. 301)? This certainly does not happen very often, with the net result that meta-analyses such as Greiser (2009) are much more likely to summarize positive than negative results and are therefore much more likely than the nominal α-error claims to find effects where none exist.

3

Publication bias and errors of the third kind

A significance level of 5 % for a statistical test means that, even without any effect being present, the test will claim one in roughly 5 out of 100 trials. This is the well known error of the first kind, which among the uninitiated often leads to an error of the third kind: to assume that a significant test implies that the alternative is true. "The sin comes in believing a causal hypothesis is true because your study came up with a positive result" (Sander Greenland form UCLA, as quoted in Taubes 1995: 169). This error of the third kind, or some variant such as "the null hypothesis is wrong with 95 % probability" occurs even among professional statisticians. Haller and Krauss (2002) asked 30 statistics instructors, 44 statistics students and 39 practicing researchers from six psychology departments in Germany about the meaning of a significant twosample t-test (significance level = 1 %). The test was supposed to detect a possible treatment effect based on a control group and a treatment group. The subjects were asked to comment upon the following six statements (all of which are false). They were told in advance that several or perhaps none of the statements were correct. 1) You have absolutely disproved the null hypothesis (that is, that there is no difference between the population means). O true / false O 2) You have found the probability of the null hypothesis being true. O true / false O 3) You have absolutely proved your experimental hypothesis (that there is a difference between the population means). O true / false O 4) You can deduce the probability of the experimental hypothesis being true. O true / false O

"True Believers" or Numerical Terrorism at the Nuclear Power Plant · 611

5) You know, if you decide to reject the null hypothesis, the probability that you are making the wrong decision. O true / false O 6) You have a reliable experimental finding in the sense that if, hypothetically, the experiment were repeated a great number of times, you would obtain a significant result on 99 % of occasions. O true / false O All of the statistics students, 90 % of the practicing psychologists and 80 % of the methodology instructors marked at least one of the above faulty statements as correct. And what is more, even lots of statistics textbooks do. Examples from the American market include Guilford (1942, and later editions), which was probably the most widely read textbook in the 1940s and 50s, Miller & Buckhout (1973, statistical appendix by Brown, p. 523) or Nunally (1975: 194 ff.). On the German market, there is Wyss (1991: 547) or Schuchard-Fischer et al. (1982), who on p. 83 of their best-selling textbook explicitly advise their readers that a rejection of the null at 5 % implies a probability of 95 % that the alternative is correct. For details, see Gigerenzer (2002, chap. 13), Krämer and Gigerenzer (2005), or Krämer (2008, chapter 8). Another mistake, unrelated to but often occurring in tandem with the one above, is to report some nominal significance level α when in reality the reported test statistic is the most significant one among η trials, each conducted at the level a . The true significance level is then simply the probability that the maximum of η test statistics is larger than some critical value and increases rapidly with n. Table 1 gives some examples for independent trials and various nominal and true significance levels of the test. Table 1 True significance level when rejection is based on the most unfavourable of η independent trials number of trials 1% 2 3 4 5 10

1,9% 3,0 % 3,9% 4,9 % 9,6 %

Nominal significance level 5% 9,8 % 14,3 % 18,5% 22,6% 40,1 %

10% 19,0% 27,1 % 34,4 % 41,0 % 65,1 %

Krämer and Runde (1992) have used this trick to establish what they call the "KrämerRunde-seven-modulo 1 effect." This means in words, that on days of the month Nr. 1, 8, 15, 22, and 29 the German stock price index DAX performs significantly better than average (t=3.161). Or in technical terms, the null hypothesis that stocks performs the same on these days as on others could be rejected, given the available data, at a level of 5 %. What Krämer and Runde also did, and also reported, were additional tests of many other hypotheses: There is no six-modulo-2-effect, there is no six-modulo-3-effect, there is no seven-modulo- 2-effect, eight-modulo-3-effect, and so on, ad nauseam. Given a particular data set and one hundred such hypotheses, all of them true, one is still bound to find about five "significant" effects, i.e. rejections of the null. And it is well known (see e.g. McCloskey 1983 or Ziliak and McCloskey 2008) that many other authors procede along similar lines, without reporting the unsuccessful trials, see also Krämer (2010, chapter 15). And although an increasing number of authors seem to be aware of this (see e.g. Fertig and Tamm 2010), only few take recourse to the impres-

612 · Walter Krämer and Gerhard Arminger

sive toolbox of multiple testing procedures which have been developed to control for this effect. In economics, this habit of reporting only the most unfavorable (to the null hypothesis) results is sometimes referred to as "data mining" (Lovell 1983) 2 . It is of course strictly illegal and rightly frowned upon. Not illegal, but equally misleading, is the related phenomenon known as "publication bias": 100 authors, each testing at 5 % , are searching for effects, but there are none. Five studies still observe significant results. All studies are submitted for publication. Which have higher chances for acceptance? One does not have to think hard (see section 2). Let us assume that 4 of the 5 studies with positive results and 36 of the 95 studies with negative results find their way into some scientific journal. This means that the true significance level of the tests is not 5 % but 1 0 % , and this happens even when no individual investigator engages in data mining. Denton (1985) calls this "collective data mining" and provides a rule of thumb to adjust for it in some selected applications. It is common knowledge that such "collective data mining" is happening in almost every field where formal tests of significance are employed. "There is some evidence that in fields where statistical tests of significance are commonly used, research which yields nonsignificant results is not published" (Sterling 1959: 30). "Such research being unknown to other investigators may be repeated independently until eventually by chance a significant result occurs." Taken to the limit, this argument implies that a "significant" effect will be found eventually almost surely, no matter what. In psychology, this bias is also known as the file drawer problem: negative results remain stuck in the file drawer. In medicine, Stern and Simes (1997) report that among 748 studies approved by the Royal Prince Alfred Hospital Ethics committee between 1979 and 1988, about 85 % were eventually published if they reported significant results at levels 5 % or less. Among studies which did not report significant results, this percentage of published papers was only 50 %. See also Beck-Bernholdt and Dubben (2004). In economics, it is above all McCloskey who has repeatedly, although with little effect, drawn attention to this phenomenon, and the implications that this form of statistical nonsense has for the field as such: "The progress of economic science has been seriously damaged. You can't believe anything that comes out of [it]. Not a word. It is all nonsense, which future generations of economists are going to have to do all over again. Most of what appears in the best journals of economics is unscientific rubbish. I find this unspeakably sad. All my friends, my dear, dear friends in economics, have been wasting their time....They are vigorous, difficult, demanding activities, like hard chess problems. But they are worthless as science" (2002: 44). This is rather harsh judgement, and a bit beside the point. For instance, the large area of specification testing, where there is no particular alternative, and therefore no "effect" to be established, has certainly improved empirical economic work a lot. But whenever significance tests are meant, not to test the validity of some model (which in case of rejection is to be substituted by a better one), but to establish a particular and prearranged alternative, pitfalls abound.

2

Not to be confused with the serious business of the same name that is a modern subject of computer science

"True Believers" or Numerical Terrorism at the Nuclear Power Plant · 613

4

Data mining in radiation epidemiology

At the time of this writing, there are 439 commercial nuclear power reactors operating worldwide. Some sites have more than one reactor (in Germany, Biblis is an example), so the number of different sites is only 210. In addition, there are 368 operational research reactors, 10 reprocessing plants, 14 uranium refineries, and several dozen uranium mining and milling facilities and atomic weapon factories each (the exact number of the latter being, for obvious reasons, hard to validate). Adding the well above 300 nuclear sites which had been in operation sometime but have by now been decommissioned or shut down, there are well above 1000 geographical locations worldwide available for testing 3 . Greiser (2009) singles out 80 of these. 4 The respective data are mostly from previous studies, which, like the Seascale studies in the UK, have in turn often been undertaken subsequent to the occurrence of leukemia clusters. This HARKing (Hypothezing After the Results are Known) reinforces the data mining effect. In Germany for instance, testing on a massive scale started only after an abnormal cluster of leukemia cases was observed close to the Krümmel power generation plant. Another important degree of freedom is the time period under consideration. The literature abounds with examples where excess mortality or morbidity was found in certain periods, but not in others (Heasman 1987; Möhner/Stabenow 1993; Kaatsch et al. 2008). For instance, the studies form Canada quoted by Greiser (2009), reporting excess incidence of childhood leukemia around Canadian nuclear power plants, cover only years up to 1986. It is rather safe to assume (and confirmed by private information from Canadian authorities) that no excess incidence was observed thereafter. Then one has to choose a distance from the potential source of radiation. Conventional choices are 6.5 km (Evrard et al. 2006) 5 , 15 km (Kaletsch et al. 1997; Möhner et al. 1993), 20 km (Laurier et al. 2008), 25 km or 50 km (COMARE 2005,2006) or complete counties, like in most studies from Canada and the U. S.. Again, there is an abundance of examples where excess incidence or mortality was observed for some distances, but not for others. It is also not true that incidence necessarily increases with proximity to power plants. Laurier et al. (2006, Table 1) for instance report 5.2 expected and 5 observed cases within a 5 km distance from 19 French nuclear power plants, as compared to 69.3 expected and 71 observed cases when the distance is increased to 20 km. Similar results are also given in Bithell et al. (2008: 195), who find "that there is no association between childhood cancer and proximity to NPs in the UK." Then there is the type of cancer (myeloid leukaemia - ML, acute lymphoblastic leukaemia - ALL, acute non-lymphoblastic leukemia, Non-Hodgkin lymphoma, other cancers), which likewise might lead to an excess for one type and a deficit for another. Kaatsch et al. (2008: 530) for instance find an excess of leukemia, but a deficit of other childhood cancers close to nuclear power plants in Germany. And sometimes there is an excess of ML but not of ALL, or vice versa, so any investigator has a large number of choices where to investigate. In addition, the age group of the children is also important. 3

4

5

The numbers are from Wikipedia and the websites of Atomforum (http://www.kernenergie.de/ kernenergie/Themen/Kernkraftwerke/Kernkraftwerke_weltweit/index.php) and the International Atomic Energy Agency (http://nucleus.iaea.org/RRDB/RR/ReactorSearch.aspx?rf=l eew). In fact, the number of sites on which his tables are based is even smaller than he claims: 69 rather than 75 in his Table 4, for instance. not 40 km, as claimed by Greiser (2009). An area of 40 square km and an area of 40 km χ 40 km are not the same.

614 · Walter Krämer and Gerhard Arminger

Laurier et al. (2006: 402) and Evrard et al. (2006, Table 2), among many others, report an excess of leukemia for some age groups, and a deficit for others. It is obvious that by judiciously adjusting these parameters it is trivial to establish "significant" effects of any sort. A prime example is Körblein and Hoffmann (1999: 18), who, being dissatisfied with negative results from another epidemiological study, got what they wanted using the same data set: "A reanalysis of the data ... reveals a statistically significant increase in childhood cancers ... when the evaluation is restricted to commercial power reactors, the vicinities closest to the plants and children of the youngest age group." Greiser (2009) uses all data available to him from previous studies, plus data from various U.S. cancer registries. The following table, compiled from his Table 4, p. 20-21, gives the number of leukemia cases for the age group 0-4. As this is also the age group where radiation induced susceptibility to leukaemia is supposed to be highest, we focus on this data set in what follows. Table 2 Observed vs. expected leukemia cases for age group 0-4, version I Country

Number of sites

Expected cases

Canada France Germany U.K. U.S.

2 19 15 9 24

47.7 108 524. 8 43.8 1244. 4

58 114 593 50 1312

total

69

1968.7

2127

Observed cases

The data for the UK cover only myeloid leukemia, which comprises about 20 % of all leukemia cases, and are therefore rather small. The data for Germany, from Kaatsch et al. (2008a), who report an excess incidence of 13 %, are not explicitly given by Greiser (2009), and are taken from the initial study. Also, the number of sites -75 - which Greiser quotes is not correct. Still, according to Table 2, the expected value of leukemia cases, if incidence around nuclear power plants were equal to the national average, is 1969, as compared to an actual number of 2127, so there certainly appears to be some reason for concern. The particular statistical procedure which was employed by Greiser to show that this excess is "significant" shall not concern us here. Rather, the point we want to make is that any "significance", no matter how it was obtained, is bound to disappear once some obvious deficiencies have been accounted for. For instance, what if some plants outside the scope of Greiser (2009) had also been included? According to Sofer (1991), Waller (1995) or Heinävara et. al (2009), cancer incidence around nuclear power plants in Sweden, Israel and Finland is no higher than elsewhere and sometimes well below. Also, no excess incidence has so far been reported for nuclear sites in Japan, Spain and Switzerland. Given the enormous media interest in occurrences of this kind, one can certainly be sure that any leukemia cluster close to a nuclear facility in these counties would have made headlines there as well. 6 Therefore, the absence of such headlines provides evidence that no such clusters have occurred. 6

In fact, there was a preliminary examination in Switzerland following the KiKK-excitement, which produced no effect and was therefore neglected by the media, see Reichmuth (2010). The final results will be available in 2 0 1 1 .

"True Believers" or Numerical Terrorism at the Nuclear Power Plant · 615

5

Disregard of confounding factors

As mentioned before, childhood leukemia often comes in clusters. Contrary to what most true believers claim, there is no consensus on the underlying causes. Extremely high doses of radiation might theoretically be responsible, but have never been observed or even been approximated in routine practice close to nuclear power plants. In fact, if there is any agreement at all among partisans in this debate, then this concerns the impossibility of routine doses of industrial radiation to cause cancer in the first place: "Based on the findings of radiation research such a connection seems implausible, because the radiation emitted by an NPP in normal operation is at least 1000 times lower than 'background radiation', i.e. the 1.5mSv of natural radiation to which the average German is exposed in a year" (Kaatsch et al. 2008b: 729). According to Ries et al. (1999, Figure 6 and Table 1.5), and confirmed by many others, risk factors which are really important in practice are race and sex. For instance, childhood cancer incidence in the U. S. is 30 % higher for boys as compared to girls and almost double for whites as compared to blacks. For leukemia only, the highest incidence rates are observed among hispanics (48.5 per million as compared to 41.6 per million for whites and 25,8 per million for blacks). By far the lowest rates for any type of childhood cancer are observed for American Indians. Also, leukemia incidence correlates strongly with income - the higher the income of the parents, the larger the risk of leukemia for kids (Borugian et al. 2005; COMARE 2006 and many others). The true underlying cause is still subject to debate; current hypotheses include an increased susceptibility of wealthy children to non-specific infectious agents (COMARE 2006: 12; wealthy children are brought up in "cleaner" environments and develop less antibodies) or a higher incidence of parental cosanguinity. In Scotland, for instance, the incidence of childhood leukemia between the richest and the poorest subpopulations differs by as much as 50 %. Other risk factors which have been identified so far are population density (more cases per 1000 children in densely populated as compared to sparsely populated areas: "it can be seen that the incidence o f . . . tumours increases as population density increases at both county district and word level" (COMARE 2006: 26) and population mixing (Kinlen 1995; Kinlen/Doll 2004; COMARE 2005: 8). Like population density, this might likewise lead to an increased exposure of susceptible individuals to infections and local epidemics which in turn could later promote the onset of cancers of many types. It would be surprising if these established covariates did not also affect the numbers in Table 2. For instance, the plant that contributes most to the surplus of 158 leukemia cases reported in the table is San Onofre Nuclear Generating Station in Southern California. It is located in the northwestern corner of San Diego County, south of the city of San Clemente, and started operations in 1967. Its initial unit is no longer in service, but two additional units, built in the early eighties, have licences to operate until 2022. According to Greiser (2009: 21, Table 4) there were 281 cases of childhood leukemia close to San Onofre (which in this case means: in San Diego County) in the 2001-2006 time period, compared to only 177 expected cases, an excess of 104. Therefore, this single data point contributes almost all of the excess cases in Table 1. Now, looking closer at the San Onofre site (see Figure 1), it appears that the power plant is almost 300 km away from the south-eastern border of San Diego County, where it is supposed to be responsible for cancer. Attributing cancer cases there to radiation in San Onofre is like attributing cancer in Hanover to the Krümmel nuclear power plant

616 · Walter Krämer and Gerhard Arminger

on the river Elbe one hundred miles to the north. This is mistake Nr. 1: The geographical area for investigating leukemia cases connected to San Onofre is much too large. Even if one focused on more densely populated areas, this argument would still apply, since the metropolitan area of San Diego, where the bulk of the population of San Diego County lives, is still more than 100 kilometers away. This means that even if there were an impact of San Onofre on childhood leukemia, it could hardly be detected with the Greiser (2009) data set. Even more important is mistake Nr. 2: The neglect of virtually all confounding factors which have so far been established in the literature. For instance, San Diego County is rather wealthy. According to Forbes Magazine, San Diego is the 4 t h wealthiest city in the U. S., and household income in San Diego County overall is 20 % above the national average, see Table 3. In addition, San Diego County has an above-average population of Hispanics and very few blacks (in the city of San Clemente, which is closest to San Onofre, blacks compose less than 1 % of the population). In fact, among children under the age of 18, the largest proportion in the meantime is hispanic (which is also the ethnic group where leukemia incidence among children is highest). Also, both population density and population mixing are more pronounced in San Diego County than elsewhere in the U. S. San Diego is the largest concentration of naval facilities in the world, with a constant moving in and out of families, which is even further accentuated by a large university and many more military facilities such as training camps, airbases, Marine corps recruit depots and coast guard stations. All of these variables correlate strongly with childhood leukemia. Summing up, among factors which are known to correlate positively with childhood leukemia, almost every one is larger in San Diego County than elsewhere in the United

"True Believers" or Numerical Terrorism at the Nuclear Power Plant · 617

States. N o t surprisingly, therefore, taking account of these covariables and using data from the early days of operation of the plant, Enstrom (1983) found that childhood leukemia is no more prevalent around San Onofre than elsewhere. Table 3 San Diego County vs. National Average (Census 2002) Variable

San Diego County

mean household income percentage blacks percentage white percentage Hispanic or latino number of white children < 5 number of black children < 5 number of Hispanic children < 5

$ 47.067 5.7 % 66.5 % 26.7 % 110 739 13 276 80261

National Average $ 41.994 12.3 % 75.1 % 12.5%

However, removing San Onofre from the Greiser (2009) data set, and adding some studies he has overlooked (for instance Bithell et al. 2008 and Kaatsch et al. 2008b) the initial surplus of leukemia cases among children aged 0-4 turns into a deficit (Table 4): Other than in Table 2, the data for the UK now comprises all sorts of acute leukemia as specified by International Classification of Childhood Cancer Groups 11 and 12; therefore, incidence is lager. The data from Germany was collected by almost the same research group which had supplied the German data for Table 2 (Kaatsch et al. 2008a), but covers a longer time span. Therefore, the data base for Table 4 is both more comprehensive and less prone to omitted variable bias (due to the deletion of San Onofre) than Table 2. Table 4 Observed vs. expected leukemia cases for age group 0-4, version II Country

Number of sites

Expected cases

Observed cases

Canada France Germany U.K. U.S.

2 19 15 13 23

47,7 108 623,7 374,9 1067,9

58 114 619 360 1031

Together

72

2222,2

2182

Will there ever be a study claiming that nuclear power protects against leukemia? With some proper data mining, and a convenient choice of statistical model, this salutary side effect can almost certainly be made highly "significant".

References Alexander, F.E. (1999), Clusters and clustering of childhood cancer: A review. European Journal of Epidemiology 15: 847 - 852. Baker, P. J., D.G. Hoel (2007), Meta-analysis of standardised incidence and mortality rates of childhood leukaemia in proximity to nuclear facilities. European Journal of Cancer Care 16 : 355-363. Beck-Bornholdt, H.-P., H.-H. Dubben (2004), Unausgewogene Berichterstattung in der medizinischen Wissenschaft - publication bias - . Hamburg (Institut für Allgemeinmedizin des Universitätsklinikums Hamburg-Eppendorf ).

618 · Walter Krämer and Gerhard Arminger

Bithell, J. F., T. J. Keegan, M . E. Kroll, M . F. Murphy, T. J. Vincent (2008), Childhood leukaemia near British nuclear installations: Methodological issues and recent results. Radiation Protection Dosimetry 132: 191-197. Borugian, M . J . , J. Spinelli, G. Mezei, R. Wilkins, Ζ . Abanto, M . L . M e Bride (2005), Childhood leukaemia and socioeconomic status in Canada. Epidemiology 16: 526-531. Bündnis 9 0 / G r ü n e (2009), AKWs erhöhen das Leukämierisiko. Pressrelease 07.09.2009. Committee on medical aspects of radiation in the environment ( C O M A R E ) (2005), The incidence of childhood cancer around nuclear installations in Great Britain. 10 t h report, London. Committee on medical aspects of radiation in the environment (COMARE) (2006), The incidence of childhood cancer around nuclear installations in Great Britain. 11 t h report, London. Clarke, E.A., J. McLaughlin, T.W. Anderson (1991), Childhood leukaemia around Canadian nuclear facilities. Phase II. Final report. Atomic Energy Control Board, O t t a w a . Denton, F.T. (1985), D a t a mining as an industry. The Review of Economics and Statistics 67: 124-127. Dewdney, A.K. (1996), 2 0 0 % of Nothing: An Eye Opening Tour through the Twists and Turns of M a t h Abuse and Innumeracy. N e w York (Wiley). Enstrom, J . E . (1983), Cancer mortality patterns around the San O n o f r e nuclear power plant, 1960-1978. American Journal of Public Health 73: 83-92. Evrard, A. S., D. H é m o n , A. Morin, D. Laurier, M . Tirmarche, J. C. Backe, M . Chartier, J. Clavel (2006), Childhood leukaemia around French nuclear installations using geographic zoning based on gaseous discharge dose estimates. British Journal of Cancer 94: 1342-1347. Ewings, P.D., C. Bowie, M . J . Phillips, S . A . N . Johnson, (1989), Incidence of leukemia in young people in the vicinity of Hinkley Point nuclear power station, 1959-86. British Medical Journal 299: 289-293. Fertig, M . , M . T a m m (2010), Always p o o r or never poor and nothing in between? duration of child poverty in Germany. German Economic Review 11: 150-168. Gigerenzer, G. (2002), Calculated risks: H o w to k n o w w h e n numbers deceive you. N e w York (Simon &C Schuster). [British edition: Reckoning with risk. London (Penguin).] Greiser, E. (2009), Leukämie-Erkrankungen bei Kindern und Jugendliche in der Umgebung von Kernkraftwerken in fünf Ländern. Report prepared for the political party Bündnis 90 / Grüne, see http://www.gruene-bundestag.de/cms/archiv/dokbiny302/302113.studie_leukaemierisiko.pdf Haller, H., S. Kraus (2002), Misinterpretation of significance: A problem students share with their teachers? M e t h o d s of Psychological Research Online 7: 1-20. H e a s m a n , Μ . Α., J . D . Urquhart, R.J. Black, I. W. Kemp (1987), Leukemia in young persons in Scotland: a study of its geographical distribution and relationship to nuclear installations. Health Bulletin 45: 147-151. Heinävara, S., S. Toikkanen, Κ. Pasanen, P.K. Verkasalo, Κ. Päivi, A. Auvinen (2010), Cancer incidence in the vicinity of Finnish nuclear power plants: an emphasis on childhood leukaemia. Cancer Causes and Control 21: 587-595. H o f f m a n n , W., H . Kuni, H. Ziggel (1996), Leukämiesterblichkeit in der N ä h e von japanischen A t o m k r a f t w e r k e n doch erhöht. Strahlentelex 238: 2-5. H o f f m a n n , W., C. Terschueren, D. B. Richardson (2007), Childhood leukemia in the vicinity of the Geesthacht nuclear establishments near H a m b u r g , Germany. Enviromental Health Perspectives 115: 947-952. Iwasaki, T., Κ. Nishizawa, M . M u r a t a (1995), Leukaemia and lymphoma mortality in the vicinity of nuclear power stations in Japan. Journal of Radiological Protection 15: 271-288. Kaatsch, P., C. Spix, R. Schulze-Rath, S. Schmiedel, M . Blettner (2008a), Leukaemia in young children living in the vicinity of G e r m a n nuclear power plants. International Journal of Cancer: 721-726. Kaatsch, P., C. Spix, I. Jung, M . Blettner (2008b), Childhood leukemia in the vicinity of nuclear power plants in Germany. Deutsches Arzteblatt International 105: 725-732. Kinlen, L.J. (1995), Epidemiological evidence for an infective basis in childhood leukaemia. British Journal of Cancer 71: 1-5.

"True Believers" or Numerical Terrorism at the Nuclear Power Plant · 619

Kinlen, L.J., R. Doll (2004), Population mixing and childhood leukemia: Fallon and other US clusters. British Journal of Cancer 91: 1-3. Körblein, A.W. H o f f m a n n (1999), Childhood cancer in the vicinity of German nuclear power plants. Medicine & Global Survival 6: 18-23. Krämer. W. (2008), Denkste - Trugschlüsse aus der Welt des Zufalls und der Zahlen. 8 t h paperback edition, München: Piper. Krämer, W. (2010), So lügt man mit Statistik. 10 th paperback edition, München:Piper. Krämer, W., R. Runde (1992), The holiday effect: yet another capital market anomaly? Pp. 453462 in: S. Schach, G. Trenkler (Hrsg.), Data analysis and statistical inference: Festschrift in honour of Friedhelm Eicker. Bergisch-Gladbach: Eul-Verlag. Krämer, W. G. Gigerenzer (2005), H o w to confuse with statistics. The use and misuse of conditional probabilities. Statistical Science 20: 223-230. Laurier, D., L. Bard (1999), Epidemiologic studies of leukemia among Persons under 25 years of age living near nuclear sites. Epidemiologic Reviews 21: 188-206. Lovell, M . C . (1983), Data mining. Review of Economics and Statistics 65: 1-12. McCloskey, D. (1983), The rhetoric of economics. Journal of Economic Literature 21: 481-517. McCloskey, D. (2002), The Secret Sins of Economics. New York: Wiley. Möhner, M., R. Stabenow (1993), Childhood malignancies around nuclear installations in the former GDR. Medizinische Forschung 6: 59-67. Michaelis, J., Β. Keller, G. Haaf, P. Kaatsch (1992), Incidence of childhood malignancies in the vicinity of West German nuclear power plants. Cancer Causes Control 3: 255-263. Miller, G. Α., R. Buckhout (1973), Psychology: The science of mental life. New York: Harper & Row. Nunally, J. C. (1975), Introduction to statistics for psychology and education. New York: McGraw-Hill. Reichmuth, A. (2010), Angstmacherei mit Atomkraft. Die Weltwoche Nr. 16, 22. April. Ries, L.A.G., M . A . Smith, J . G . Gurney, M . Linet, T. Tamra, J.L. Young, G.R. Bunin (1999), Cancer Incidence and Survival among Children and Adolescents: United States SEER Program 1975-1995. National Cancer Institute, Bethesda, M D . Schuchard-Fischer, C., K. Backhaus, H. Hummel, W. Lohrberg, W. Plinke, W. Schreiner (1982), Multivariate Analysemethoden - Eine anwendungsorientierte Einführung. 2 n d edition, Berlin:Springer. Sharp, L., R.J. Black, E.F. Harkness, P.A. McKinney (1996), Incidence of childhood leukaemia and non-Hodgkin's lymphoma in the vicinity of nuclear sites in Scotland. Occupational and Environmental Medicine 53: 823-831. Sterling, T. R. (1959), Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa. Journal of the American Statistical Association 54: 30-34. Stern, J. M., R.J. Simes (1997), Publication bias: evidence of delayed publication in a cohort study of clinical research projects. British Medical Journal 315: 640-645. Sofer, T., J. R. Goldsmith, I. Nusselder, L. Katz (1991), Geographic and temporal trends of childhood leukaemia in relation to the nuclear power plant in the Negev. Public Health Review 19: 191-198. Swensen, A.P., J. A. Ross, X. O. Shu, G . H . Reaman, M. Steinbuch, L.L. Robison (2001), Pet ownership and childhood acute leukaemia. Cancer Causes and Control 12: 301-303. Taubes, G. (1995), Epidemiology faces its limits. Science 269: 164-169. Waller, L.Α., B.W. Turnbull, G. Gustafsson, U. Hjalmars, B. Andersson (1995), Detection and assessment of clusters of disease: an application to nuclear power plant facilities and childhood leukaemia in Sweden. Statistics in Medicine 14: 3-16. White-Koning, M. L., D. Hémon, D. Laurier, M . Tirmarche, E. Jougla, A. Goubin, J. Clavel (2004), Incidence of childhood leukaemia in the vicinity of nuclear sites in France, 19901998. British Journal of Cancer 91: 916-922. Wyss, W. (1991), Marktforschung von A - Z. Lucerne:Demascope. Ziliak, S., D. McCloskey (2008), The Cult of Statistical Significance: H o w the Standard Error Costs Us Jobs, Justice and Lives. Ann Arbor: University of Michigan Press.

620 · Walter Krämer and Gerhard Arminger

Prof. Dr. Walter Krämer, Institut für Wirtschafts- und Sozialstatistik, Fakultät Statistik, Technische Universität D o r t m u n d , 4 4 2 2 1 D o r t m u n d , Germany. [email protected] Univ.-Prof. Dr. Gerhard Arminger, Lehrstuhl für Wirtschaftstatistik, Fachbereich Β Wirtschaftswissenschaft, Schumpeter School of Business and Economics, Bergische Universität Wuppertal, Gaußstrasse 20, 4 2 0 9 7 Wuppertal, Germany. [email protected]

Jahrbücher f. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011) Bd. (Vol.) 231/5+6

One-eyed Epidemiologie Dummies at Nuclear Power Plants A Reply to Walter Krämer and Gerhard Arminger's Paper "True Believers" or Numerical Terrorism at the Nuclear Power Plant' By Eberhard Greiser, Musweiler JEL 112; C19 Meta-analysis; childhood leukaemia; nuclear power plants; misconception of epidemiology.

Summary Krämer and Arminger in a preceding article in this volume insinuated that in a meta-analysis on childhood leukaemia in the vicinity of nuclear power plants (NPP) in five countries gross methodological errors had led to falsified statistics. Major assumptions were a) arbitrary exclusion of publications with nil results, and b) publication bias in conduct of the meta-analysis. It is demonstrated that all appropriate publications providing data on incident cases of leukaemia and on the underlying population or rates of incidence with confidence intervals had been included. In addition it is demonstrated that all publications excluded from the meta-analysis either did not provide sufficient data on NPPs or cases of these publications had been already included into the meta-analysis from other publications.

In 2 0 0 9 1 was commissioned by the fraction of the Green Party in the Bundestag (German Federal Parliament) to conduct a meta-analysis on the possible correlation of living in the vicinity of a nuclear power plant and leukaemia. To do so I researched the scientific literature for publications that provided for individual nuclear power plants numbers of incident cases of leukaemia, especially in children. Moreover, I was able to get specific leukaemia data for US counties with nuclear power plants both from a nation-wide cancer registry project (SEER), supervised by the US National Cancer Institute as f r o m several US State cancer registries. These data were used to calculate relative rates for incidence, applying national incidence rates as reference. The principle of a metaanalysis is to calculate weighted pooled incidence rates. Walter Krämer took the opportunity to comment on my meta-analysis 1 four days after it had been presented in a press conference in Berlin in the German newspaper "Die Welt" on June 8, 2009 2 , insinuating that on purpose only those nuclear power plants with increased relative incidence rates had been selected for the meta-analysis, discarding all others. The article by Walter Krämer and Gerhard Arminger is kind of a sequel to the commentary of Walter Krämer. The authors (Krämer and Arminger) in their article now claim that several errors could have led to the conclusion that the incidence of childhood leukaemia is increased in the 1 2

See Greiser (2009) See Krämer (2009)

622 • Eberhard Greiser

vicinity of nuclear power plants. The most damaging of these errors - according to Krämer and Arminger - are • Publication bias, as they understand it; and • Disregard for covariates and inclusion of outliers. They also state that it should be inappropriate to start an analytic study where beforehand the suspicion of a causal association between leukaemia and emissions from a nuclear power plant had been raised after an incidence study has shown that an increased incidence of leukaemia existed in the immediate vicinity of the power plant. Furthermore, they adopt the claim by Taubes that "epidemiological evidence of any sort should only be taken seriously if there is at least a twofold increase in the risk observed". 3 All these statements demonstrate a fundamental misunderstanding of epidemiological principles and results, although these misconceptions are to be expected from non-epidemiologists. However, as Krämer and Arminger are intervening in an epidemiologic terrain which is not within their primary field of expertise, they should at least consider the simple logic which is the basis of epidemiologic thinking. To deconstruct Taubes' claim, one should remember what leads to significant risk increases in epidemiology. There are two essential prerequisites that have to be met: 1. That there is a causal relationship between a risk factor and a disease outcome; and 2. That there is a sufficient number of observed patients within one or several studies. In fact, if a risk increase much smaller than 100 % (equivalent to an odds ratio of 2.0) is based on a large number of diseased persons, a risk increase of 20 % could be significant and could in fact be of major importance for public health. Therefore, if Taubes' position were taken seriously, most of the results of environmental epidemiology, as well as occupational epidemiology, would not be worthy of discussion and certainly would not lead to regulatory interventions. A prominent example to demonstrate the absurdity of Taubes' claim is the causative role of environmental tobacco smoke in the development of cancer. This issue has been the subject of intensive research and of heated discussion, mainly between epidemiologists and lobby groups (and their scientific proponents) over several decades in the last century. Nowadays, nobody doubts that such causation exists. The International Agency for Research on Cancer (IARC) that is part of the World Health Organization has spent decades publishing scientific monographs reviewing the pertinent scientific evidence on the factors that could lead to an increased risk of cancer. Its 2002 monograph on environmental tobacco smoke 4 stated that the excess risk for the development of lung cancer in non-smokers after being exposed to tobacco smoke from smoking spouses or co-workers comes to 20 % in women and to 30 % in men, figures way below the 100 % which Taubes - and his followers Krämer and Arminger - demand as a cutoff point for taking an excess risk as worthy of further discussion and investigation. A second grave misconception regarding epidemiologic reasoning is the ban on investigating a possible correlation of a presumed risk factor to a specific disease where a suspicion exists that such a correlation had been raised either by chance observation or by a properly conducted incidence study. Krämer and Arminger obviously regard such a scientific malfeasance as equivalent to a self-fulfilling prophecy. However, one might also argue, as it is a fundamental argument of epidemiologists, that to plan and to 3 4

See Taubes (1995) See IARC (2004)

One-eyed Epidemiologic Dummies at Nuclear Power Plants · 623

conduct an epidemiologic study there should be at least some indication of a possible risk increase. To illustrate this, let us take the argument of Krämer and Arminger back 150 years into history, to the time when an epidemic of cholera was decimating the population of London. The suspicion was raised that one of the causes of cholera could be polluted drinking water (and it has to be remembered that in 1850 nothing was known about bacteria). Following the Krämer-Arminger reasoning, John Snow should have refrained from investigating cholera in London, and should instead have resorted to, say, Edinburgh, where no such suspicion had been raised. Nowadays, the investigations of John Snow are the classic example of environmental epidemiology, the subject of introductory lectures for first-year students of epidemiology all over the world. Furthermore, I would like to quote a personal experience which contradicts the "selffulfilling prophecy" suspicion of Krämer and Arminger. Since about 1990, a cluster of childhood leukaemia has been observed by the German Childhood Cancer Registry in the immediate vicinity of the Krümmel nuclear power plant on the banks of the river Elbe near Hamburg. In 1992, there were seven children affected. Great public concern led to huge amounts of money being spent in attempts to determine all kind of possible risk factors in these cases - without any scientific result. The author (E.G.) argued at that time that, if radioactive emissions were responsible for these seven cases of leukaemia in children below the age of 15, there could be incident cases of leukaemia in young adults, too. After lengthy discussions it was decided to follow a two-step approach: the first step was to investigate in an incidence study if the incidence of leukaemia (and of malignant lymphoma) was increased in the population living in a 5-kilometre circle as compared to the population living in regions more distant to the nuclear power plant. The second step - conducting a case-control study with inclusion of all potential risk factors besides living in the vicinity of the nuclear power plant - was to start only if the results of the incidence study showed a significant increase within the 5-kilometre circle. After completion of the incidence study an increase of leukaemia risk within the 5-kilometre circle - for the author totally unexpected - emerged and subsequently an argument similar to that employed by Krämer and Arminger was raised by interested parties to prohibit the implementation of a case-control study, as agreed beforehand. However, after two years of discussion, this case-control study went ahead in the spring of 1997. This case-control study finally comprised interviews of about 1,500 cases (=patients) of leukaemia or malignant lymphoma and about 3,000 controls (=persons without leukaemia or malignant lymphoma), drawn from the general population. It was at that time the largest case-control study conducted in Germany on this issue. The results, which were presented in the spring of 2004, were disappointing for the anti-nuclear activist groups, as it could not be shown that living in the vicinity of the Krümmel nuclear power plant (or of two other nuclear power plants in northern Germany) contributed to an increased risk of leukaemia. However, there was strong evidence that insecticides, herbicides and wood preservatives all had a causative role in leukaemia in adults. The most prominent reproach that Krämer and Arminger hold against the meta-analysis the author conducted on the subject of childhood leukaemia is stated as follows: "The first and most prominent source of error is an understatement of the true size of tests of significance which results from the well known publication bias." Publication bias is a phenomenon that plays a prominent role in clinical research when, for example, the efficacy of pharmaceutical drugs is to be estimated using the method of meta-analysis to establish a weighted pooled effect measure over a series of different therapeutic trials.

624 · Eberhard Greiser

It is a well known fact that studies providing proof of efficacy are more likely to be published, as pharmaceutical companies are more likely to favour the publication of "positive" trials. It has been shown repeatedly that publications of therapeutic trials sponsored by pharmaceutical companies demonstrate larger effects than trials sponsored by national funding agencies, such as the US National Institute of Health or comparable agencies in other countries. However, in order to detect a bias due to underreporting of the less favourable results from therapeutic trials, special procedures have been developed, including a graphical presentation of results named the Forest plot. The question now arises as to the kind of publication bias to be expected when it comes to unfavourable outcomes of possible emissions from nuclear power plants. Regarding vested interests, two major groups can be identified: nuclear power companies and regulatory agencies interested in publications that show that there is no harm to the health of the population to be expected from the operation of nuclear power plants. In addition, anti-nuclear activist groups would probably be interested in the publication of studies showing the harmful effects of nuclear power plants. However, as the lobbies of the nuclear industry are much better organized than anti-nuclear activist groups, one should expect a bias in favour of more publications showing nil results. In the metaanalysis conducted by the author, there is just one publication included detailing harmful effects 5 , whereas six additional publications show no significant risk increase. In such a case, the author does not dare to state that one would have expected more publications displaying the harmful effects of emissions of nuclear power plants without a publication bias being present. Krämer and Arminger do not criticize two aspects of the meta-analysis conducted by the author: A. Included were publications which contributed data on single nuclear power plants as well as unpublished data provided by the US SEER data base and by the cancer registries of three US States (Pennsylvania, Illinois, and Florida). B. The meta-analysis combines data on individual nuclear power plants with the results of a German case-control study, which has been conducted in the vicinity of 16 nuclear power plants and provides an overall result only. It could have been argued that such a mix of data from publications and from databases is unconventional and could contribute to bias, though not to a publication bias. Furthermore, Krämer and Arminger propose excluding outliers, defined as regions with nuclear power plants, from the meta-analysis. They state, for example, that the region around the nuclear power plant San Onofre in California is distinct from other regions with nuclear power plants in so far as the proportion of persons of Hispanic origin is higher than in other regions and, as Hispanics are more prone to getting leukaemia, San Onofre should therefore have been excluded. However, it is a general rule in epidemiology not to purposely exclude outliers from studies. This would certainly encompass the intolerable danger of tailoring a study population to fit desired results. In addition, the proposed rule to exclude nuclear power plants with a prominent proportion of persons of Hispanic origin in the nearby area would also apply to the nuclear power plants of Diablo Canyon, also in southern California, and of Turkey Point in Florida, which Krämer and Arminger do not propose to exclude.

5

See Kaatsch et al. (2008a)

One-eyed Epidemiologic Dummies at Nuclear Power Plants · 625

Krämer and Arminger further criticize the non-inclusion of several publications which in general claim that there is no increase in leukaemia risk in the vicinity of nuclear power plants (e.g. Laurier/Bard 1998 6 ). Perhaps they overlooked that Laurier and Bard published a review without providing data for individual nuclear power plants, whereas in the meta-analysis conducted by the author all data included (with the exception of the case-control study by Kaatsch et al.) were data from individual power plants. Moreover, perhaps Krämer and Arminger neglected to study in detail some publications they are quoting as examples which, in their opinion, should have been included in the meta-analysis: a) Iwasaki et al. 7 could not find an increased mortality rate around Japanese nuclear power plants. However, the author did not include data on mortality related to leukaemia in the meta-analysis, as mortality does in no way reflect incidence of leukaemia and thus is an inappropriate measure of this disease. b) Michaelis et al. 8 came to the conclusion that there was no excess risk in the vicinity of German nuclear power plants. As it is a general rule that the same data should not be included twice in a meta-analysis, this publication had to be omitted as all of the data are included in the aforementioned case-control study published by Kaatsch et al. in 2008. c) The publication by Kaatsch in the German Medical Journal 9 is very interesting, as it states that there is no increased risk of leukaemia at all and thus contradicts Kaatsch et al's findings in the International Journal of Cancer, also published in 2 0 0 8 , where an increase of leukaemia in the 5-kilometre circle around nuclear power plants is correctly stated. d) The publication by Waller et al. 1 0 on leukaemia in the vicinity of four Swedish nuclear power plants does not provide any information on the underlying population or expected cases of leukaemia, data that are necessary for inclusion in a meta-analysis. e) White-Kooning et al. 1 1 stated that they did not find any increase in childhood leukaemia rates around French nuclear power plants. This publication was deliberately excluded, because a later publication by Evrard et al}2 covered a larger time period (1990-2001) compared to the study by White-Kooning and et al. (1990-1998). f) The publication by Heinävaara et al. 1 3 is an interesting example of editorial neglect on the part of Krämer and Arminger, as the meta-analysis of the author was completed on 1 September, 2 0 0 9 , whereas the work by Heinävaara et al. was published online on December 2 7 , 2 0 0 9 . As the author has no crystal ball to look at publications which might be published in future, he is at loss as to how to comply with this reproach. What has not been criticized by Krämer and Arminger is certainly an error of judgement: in conducting the meta-analysis, fixed-effect models were assumed. Analyses conducted after completion of the meta-analysis (1 September, 2009) using random-effect models showed greatly diminished increased risks. It cannot be decided if not reporting this error 6 7 8 9 10 11 12 13

See Laurier and Bard (1998) Iwasaki et al. (1998) Michaelis et al. (1992) Kaatsch et al. (2008b) Waller et al. (1995) White-Kooning et al. (2004) Evrard (2006) Heinävaara (2010)

626 • Eberhard Greiser

is due to Krämer and Arminger lacking familiarity with the methodology of meta-analyses or due to other causes. The general critique of Krämer and Arminger referring to confounding risk factors is interesting, but goes beyond the method of meta-analysis: it would be worthwhile to conduct a multivariate regression including other area-specific variables besides those factors quoted by Arminger and Krämer. However, any such study would be an ecological study and would therefore have all the disadvantages of ecological studies. It is disappointing that Krämer and Arminger did not refer in detail to the methods and results of the single case-control study that has been conducted so far to elucidate the association between childhood leukaemia and living in the vicinity of nuclear power plants. The so-called KiKK study combined two case-control studies: one study comprising all cases of childhood leukaemia reported to the German Childhood Cancer Registry from 1980 to 2003 and a second one comprising cases reported f r o m 1993-2003. The first study used the exact distance to the nearest nuclear power plant as a single independent factor, whereas the second case-control study also employed a comprehensive questionnaire with all known risk factors for childhood leukaemia included. The final report of these studies 1 4 unequivocally provides proof of an increased risk of childhood leukaemia in ages 0-5 up to a distance of 50 kilometers with increasing vicinity to the nearest nuclear power plant. Unfortunately, none of the researchers in other countries followed this design, which was proposed by the author during discussions of a design by a larger group of scientists, convened by the German Federal Office of Radiation Protection ,in the year 2002. In conclusion, after the incidents at the Fukushima nuclear power plant in Japan, the question of whether nuclear power plants are safe at all is obsolete. However, Krämer and Arminger should perhaps take to heart what the great Austrian philosopher Ludwig Wittgenstein 1 5 wrote nearly a century ago: "Whereof one cannot speak, one must pass over in silence."

References Evrard, A. S., D. Hemon, A. Morin, D. Laurier, M. Timarche, J. C. Backe, M. Chartier, J. Clavel (2006), Childhood leukaemia around French nuclear installations using geographic zoning based on gaseous discharge dose estimates. British Journal of Cancer 94: 1342-1347. Greiser, E. (2009), Leukämie-Erkrankungen bei Kindern und Jugendlichen in der Umgebung von Kernkraftwerken in fünf Ländern. Meta-Analyse und Analyse. Im Auftrage der Bundestagsfraktion B'90/Die Grünen. 1. September 2009. Heinävaara, S., S. Toikkanen , K. Pasanen, P.K. Verkasalo, P. Kurttio, A. Auvinen (2010), Cancer incidence in the vicinity of Finnish nuclear power plants: an emphasis on childhood leukaemia. Cancer Causes and Control 21: 587-595. IARC (2004), Tobacco smoke and involuntary smoking. IARC Monographs on the Evaluation of Carcinogenic Risks to Humans, Vol. 83., Lyon. Iwasaki, T., Κ. Nishizawa, M. Murata (1998), Leukaemia and lymphoma mortality in the vicinity of nuclear power stations in Japan. Journal of Radiological Protection 15: 947-952. Kaatsch, P., C. Spix, S. Schmiedel, R. Schultze-Rath, A. Mergenthaler, M. Blettner M. (2007), Epidemiologische Studie zu Kinderkrebs in der Umgebung von Kernkraftwerken (KiKK-Studie). Abschlussbericht. Mainz.

14 15

Kaatsch (2007) Wittgenstein (1922)

One-eyed Epidemiologic Dummies at Nuclear Power Plants · 627

Kaatsch, P., C. Spix, R. Schultze-Rath, S. Schmiedel, M. Blettner (2008a), Leukaemia in young children living in the vicinity of German nuclear power plants. International Journal of Cancer 1220: 721-726. Kaatsch, P., C. Spix, I. Jung, M. Blettner (2008b), Childhood Leukaemia in the vicinity of German nuclear power plants. Deutsches Aerzteblatt International 105: 725-732. Krämer, W. (2009), Wie Statistiken manipuliert werden. Kernkraft macht Schweißfüße. Die Welt, 8. September 2009. (How statistics are manipulated. Nuclear power induces sweaty feet.) Laurier, D., L. Bard (1998), Epidemiologic studies of leukaemia among persons under 25 years of age living near nuclear sites. Epidemiologic Review 21: 188-206. Michaelis, J., B. Keller, G. Haaf, P. Kaatsch (1992), Incidence of childhood malignancies in the vicinity of West German nuclear power plants. Cancer Causes Control 3: 255-263. Taubes, G. (1995), Epidemiology faces its limits. Science 269: 164-169. Waller, L.Α., B.W. Turnbull, G. Gustafsson, U. Hjalmars, B. Andersson (1995), Detection and assessment of clusters of disease: an application to nuclear power plant facilities and childhood leukemia in Sweden Statistics in Medicine 14: 3-16. White-Kooning, M.L., D. Hemon, D. Laurier, M. Timarche, E. Jougla, A. Goubin, J. Clavel (2004), Incidence of childhood leukaemia in the vicinity of nuclear sites in France, 19901998. British Journal of Cancer 91: 916-922. Wittgenstein, L. (1922), Tractatus Logico-Philosophicus. With an Introduction by Bertrand Russell. F. R. S. New York, Harcourt, Brace &c Company, Inc. Eberhard Greiser, M D , PhD, Epi.Consult GmbH, Ortstr. 1 A, 54534 Musweiler, Germany; and Institute of Public Health and Nursing Research, Faculty of Health Sciences, Bremen University, Bremen. [email protected]

Jahrbücherf. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart2011) Bd. (Vol.) 231/5+6

Are Most Published Research Findings False? By Andreas Diekmann, Zurich* JEL C12; C18 Tests of significance; statistical errors; replications.

Summary In a provocative article Ioannidis (2005) argues that, in disciplines employing statistical tests of significance, professional journals report more wrong than true significant results. This short note sketches the argument and explores under what conditions the assertion holds. The "positive predictive value" (PPV) is lower than V2 if the a priori probability of the truth of a hypothesis is low. However, computation of the PPV includes only significant results. If both significant and non-significant results are taken into account the "total error ratio" (TER) will not exceed V2 provided no extremely large publication bias is present. Moreover, it is shown that theory-driven research may reduce the proportion of errors. Also, the role of replications is emphasized; replication studies of original research are so important because they drastically decrease the error ratio.

Typical research findings communicated by the mass media include the following: "Beautiful parents have more daughters than ugly parents" (Kanazawa 2006), curry increases the cognitive capacity of the brain (Tze-Pin Ng et al. 2006), medical doctors perform better in surgery if they have experience with video games (Rosser et al. 2 0 0 7 ) or to cite an example from economics - lefthandedness in higher-educated males has a significant effect on wage level (Ruebeck et al. 2006). It would not be difficult to continue this list. Moreover, there is no allegation that these studies are not based on a state-of-theart statistical analysis of carefully collected data. Nevertheless, one is skeptical toward the results if the reported effects are new and have never been replicated. The same rule of caution should also apply to less spectacular findings of "significant" effects. Why? A provocative article by Ioannidis (2005) aims to establish "why most published research findings are false". In the following piece, I will first sketch Ioannidis' argument. Then, it is asked under what conditions this assertion holds. Finally, some implications are discussed.

* I would like to thank Gabriela Koenig, Walter Krämer, Ben J a n n , Matthias Naef, and Manuela Vieth for their valuable comments.

Are Most Published Research Findings False? · 629

1

The probability that a significant effect is wrong

T h e c o r e of the a r g u m e n t is similar t o the w e l l - k n o w n e x a m p l e of a H I V test with both a false-positive and a false-negative e r r o r rate of 1 percent. Let's assume 0 . 1 percent of the population is infected. By Bayesian reasoning, w e c a n determine the conditional probability that a test result is w r o n g given a positive screening. W e do not need a f o r m u l a t o c o m p u t e the value. A numerical e x a m p l e suffices and is m o r e illustrative. Assume w e are testing a p o p u l a t i o n of 1 0 0 , 0 0 0 , of w h i c h 1 0 0 are H I V infected. Ninety-nine of these 1 0 0 HIV-infections will be detected by the test. O f the remaining non-infected 9 9 , 9 0 0 individuals, 9 9 9 persons will be w r o n g l y classified as infected. In total, 9 9 + 9 9 9 = 1 0 9 8 persons with a positive screening result. T h e e r r o r r a t i o a m o n g the " p o s i t i v e " results is 9 9 9 / 1 0 9 8 o r 9 1 percent. N o w assume a ' p o p u l a t i o n ' of hypotheses. Some hypotheses, let's say 4 percent, are true. T h e α - e r r o r of a test of significance is 0 . 0 5 . T h e p o w e r of the test, 1-ß, is 0 . 8 0 , i . e . the ß-enoi

is 0 . 2 0 . T h e population consists of 1 0 , 0 0 0 hypotheses. T h u s , 4 0 0 hypotheses

are true and 3 2 0 will be detected by the test. Given the α-error, 4 8 0 of the remaining 9 , 6 0 0 hypotheses are w r o n g l y classified as significant. The e r r o r ratio a m o n g the significant results is 4 8 0 / 8 0 0 o r 6 0 percent (see also T a b a r r o k 2 0 0 5 ) . M o r e generally, let Ρ denote the a priori probability of a hypothesis being true. T h e conditional probability that a hypothesis is true given it passed the level of significance is called the "positive predictive v a l u e " (PPV). T h e e r r o r r a t i o a m o n g the significant findings is 1 - PPV. T h e P P V is derived easily. F o r a true alternative hypothesis, the probability t h a t the outc o m e is significant (i. e. t h a t the null hypothesis is c o r r e c t l y rejected) is (1-/?)P. F o r a true null hypothesis, the probability o f a significant result (i.e. that the null hypothesis is w r o n g l y rejected) is a ( l - P ) . T h e n , the P P V is 1 : P P V = (1-/?)P/[(1-/?)P + a ( l - P ) ] = 1 / [ 1 + ( a / ( ί - β ) ) ( ( 1 - P ) / P ) ] . In a c c o r d a n c e w i t h one's intuition, the P P V decreases with both errors α and β and increases with the prior probability P. W h e n a a p p r o x i m a t e s zero, P P V c o n v e r g e s t o 1.

2

When significance tests lead to more false than correct results

" M o s t published research findings are false" if the P P V is less t h a n V2. F o r P P V < V2 it follows (Ioannidis 2 0 0 5 ) : P / ( l - P ) < aJ{ 1-ß). F o r α = 0 . 0 5 and β = 0 . 2 0 , an a priori probability of Ρ < 1 / 1 7 leads t o a situation where m o r e t h a n 5 0 percent of all significant findings are false. There is nothing w r o n g with this a r g u m e n t . N o t e also t h a t there is n o assumption of a "file d r a w e r effect" o r " p u b l i c a t i o n b i a s " of non-significant results, because the derivations solely focus on significant effects 2 . T h e result follows even if all researchers carefully c o m p l y with the rules of statistical hypothesis testing! A possible high e r r o r r a t i o of ' n o r m a l ' science is alarming

1

2

See Ioannidis ( 2 0 0 5 ) for the derivation of the PPV and further results. Here, I have simplified the notation. See Auspurg and Hinz and Weiss and Wagner in this volume on publication bias.

630 · Andreas Diekmann

and seems to be a more serious problem than intentional data falsification. Of course, the error ratio of significant findings (1-PPV) will move further upwards if researchers manipulate the data to attain significant results (Ioannidis 2005). 3

The a priori probability Ρ

We know the α-error because this value is fixed by the researcher's decision, β depends on a, on the effect size and its variance, on the statistical properties of the test of significance, and the sample size. But what about P? Ioannidis' example is the human genome. Some trait χ is caused by a single gene. There are about 2 0 , 0 0 0 candidates for the monogenetic effect and, therefore, Ρ is 1 / 2 0 , 0 0 0 . However, in biomedical or social science research we usually do not deal with a finite population of hypotheses. The list of potentially explanatory variables (or their combinations) is infinite. With an infinite population of hypotheses, the logic of the argument does not change. A low P-value inevitably leads to a large error ratio or low PPV concerning the significant effects we read about in professional journals. Yet the problem remains that Ρ is unknown. For most research fields, even a rough estimate of Ρ is not available and, therefore, an estimate of the PPV is similarly unavailable. In a strict sense, then, Ioannidis' assertion ( "most published research findings are false" ) can neither be verified nor falsified. Nevertheless, the Bayesian tautology is still useful. The Bayesian transformation of prior to posterior probabilities draws our attention to the possible high error ratio of significant results. More important, the formulas have implications for research strategies aiming at the reduction of errors.

4

Non-significant results and the total error ratio

Another caveat refers to non-significant results, which are ignored in Ioannidis' analysis. His assertion that "most research findings are false" refers solely to published significant results. If non-significant results were included in the definition of the error ratio it would decline substantially because most false hypotheses are identified as false. However, non-significant results are excluded by definition. Ioannidis (2005) defines a "research finding as [...] any relationship reaching formal statistical significance", although he acknowledges the importance of non-significant or "negative" results. The title of his article ("why most research findings are false") may be correct by definition but is also somewhat misleading. Let us reconsider the example regarding a population of hypotheses. Four hundred and eighty significant and 80 non-significant findings were false. Relate this number to the 10,000 hypotheses under test. The overall error ratio including the non-significant findings is only 5.6 percent. In general, the total error ratio (TER) is /?P + a(l-P), provided there is no publication bias. For reasonable values of a and β, the TER will never be larger than V2. 5

The TER and file drawer effect

Non-significant findings are often not published, as a result of the researcher's or the publisher's decision (the so-called "file drawer effect"). Due to publication bias, the overall error ratio of published results is much higher than is suggested by our example above. It may be possible that the error ratio 1 - PPV is larger than 50 percent. The question

Are Most Published Research Findings False? · 631

arises, then, whether the T E R might exceed 5 0 percent as well if publication bias is present. Let θ denote the probability of a non-significant result being published. Then, the T E R of published results is: T E R = [ΘβΡ + α(1 - P ) ] / [ ( l - j ß ) P + ΘβΡ + α(1 - Ρ) + 0(1 - α ) ( 1 - Ρ)]. Without any publication bias (θ = 1), the T E R of published results is ßP + α(1 - Ρ) as before and, for β > a, T E R increases with Ρ in the interval [α, β]. In the opposite case of a total file drawer effect (Θ = 0) (i.e. only significant findings are published), it follows that T E R = 1 - PPV. For small Ρ and conventional values of a and ß, T E R increases with the publication bias (1 - Θ) while the PPV is independent of Θ. The T E R of published findings exceeds 5 0 percent only for a very small Θ, i. e. a very large publication bias. From the equation above, it follows for T E R > V2: θ < [α(1 - Ρ) - (1 - β)Ρ]/[(1 - α)(1 - Ρ) - βΡ], For small Ρ, the relation converges to: θ < α/(1 - α). Thus, for a = 0.05, only a very large publication bias (Θ < 0.05) leads to a T E R of more than 50 percent. While it is likely that, in some research fields, the error ratio among the significant findings (1-PPV) is larger than 5 0 percent, the T E R is likely to be much lower even in the presence of publication bias 3 .

6

The importance of a good theory

Other things being equal, the PPV increases with the a priori probability P. This probability is low if a researcher employs an inductive strategy. Of course, there are valuable explorative studies. However, an explorative study should not use significance tests or at least should not interpret the test statistics in the conventional way. Consider the study of "handedness and earnings" cited above. Ruebeck et al. (2006) report that, "left-handed college-educated people earn 15 percent more than right-handed college-educated people. This wage differential is found for males but not for females." The difference is significant, while many other variables and interaction effects in the study are not. There is no theory or ex-ante hypothesis suggesting an effect for this special subgroup. To be precise, Ruebeck et al. do not violate any norm of scientific conduct. Although they try to explain the results by ex-post reasoning, in a strict sense they do not do any HARKing ("hypothesizing after the results are known") as defined by Kerr (1998) 4 . The authors clearly describe the research procedure, they rightly recommend replicating their research to find out whether the results are robust and, in particular, they do not hide that there was no ex-ante hypothesis for this subgroup (pretending there was an

3

To c o m b a t publication bias there are now journals exclusively reporting negative results. For example, the Journal of Negative Results in Biomedicine was founded in 2 0 0 2 . In the social sciences, working paper archives such as R e P E c can be used to communicate negative results without possible censorship by journal editors. O f course, these efforts are only small steps in solving the problem of publication bias.

4

In particular, improper use of software for structural equation models such as L I S R E L or A M O S is prone to HARKing (Kerr 1 9 9 8 ) .

632 · Andreas Diekmann

a priori hypothesis is part of Kerr's definition of HARKing). Note that Rueckert et al. do not report a significant main effect but present a second-order interaction effect of sex, education and handedness on earnings. By employing simple combinatorics, it is easy to demonstrate that even a small set of variables may lead to a huge number of first-, second-, and higher-order interactions. Exaggerating somewhat finding a significant effect for one or the other subgroup is no more surprising than the observation that every week some people are the lucky winners in a state lottery. An inductive strategy, which is searching for main and interaction "effects", implicitly assumes an extremely low a priori probability for a specific effect. This in turn implies an error ratio (1-PPV) of almost 100 percent! On the other hand, a deductive theory that is based on well-confirmed assumptions gives a higher likelihood of hypotheses that follow from the theoretical assumptions. A good theory is a filter to separate likely hypotheses from unlikely ones. Employing the inductive strategy, all genes would have the same a priori likelihood of producing a certain trait. However, a researcher, guided by a theory, will select a certain subset of likely candidates that are supposed to generate the observed effect. Genes of this subset have a higher a priori probability of producing the trait of interest than genes excluded by the theory. For instance, consider the curry study cited in the introduction. There are a huge number of spices or other types of food with nutritional properties which may or may not enhance cognitive abilities. Researchers trying to explore the relation between numerous kinds of food and the cognitive ability of consumers of that type of food will most likely produce numerous artifacts. Yet a theory based on the well-corroborated assumption that substance χ will promote cognitive ability plus the assumption that curry contains substance χ gives rise to the expectation that cognitive ability is enhanced by the intake of curry. The latter hypothesis has a larger probability than simply selecting a type of food by chance and exploring its impact on cognitive ability. The philosopher Karl Popper chose the metaphor of a "searchlight" (Popper 1979, Appendix 1). In a theory, we have a searchlight to select a certain subset of hypotheses. Theory-guided research is a means to increase the a priori probability Ρ and to reduce the error ratio among published significant findings. This is an additional argument in support of the theory-driven deductive approach advocated by philosophers of science. 7

The importance of replications

In the example regarding a population of hypotheses with parameters a=0.05, /?=0.2 and P=0.04, the error ratio of significant results was 60 percent. Only for the purpose of illustration, we further assume the population of hypotheses comprises 10,000 candidates. Recall that we have 800 significant findings, among them 320 true results and 480 false-positive ones. Now, assume a replication 5 of all significant findings. The replication yields 256 findings which are significant and true and 24 findings which are significant and wrong. The PPV is 256/280 = 0.914 and the error ratio 1-PPV is 0.086. 5

Here, we are concerned with replication studies based on new data. Equally important are internal replications or "econometric audits" (Kane 1984), i.e. reanalyzing the data of the source of the original study. Journals and authors ought to be obliged to give free access to well-documented data for re-analysis. Only a few journals have acceptable standards guaranteeing reanalysis (Freese 2007). The reanalysis often shows that the original results do not hold after critical inspection (Krämer et al. 1985; Dewald et al. 1986).

Are Most Published Research Findings False? · 633

Thus, w i t h a replication of significant findings, the error ratio decreases f r o m 6 0 percent t o 8.6 percent! Moreover, a large a m o u n t of error reduction by replication is not only hypothetical. M o o n e s i n g h e et al. (2007) give a n illustrative example f r o m genetics: "a survey of 6 0 0 positive associations between gene variants a n d c o m m o n diseases showed t h a t out of 166 reported associations studied three or more times, only six were replicated consistently". M o r e generally, with every step η of a replication we observe an increase in the PPV: PPV n + 1 = ( 1 -/?)PPV n /[( 1 -/?)PPV n + a(l-PPV n )]. Consider that true hypotheses are a m o n g the non-significant results. We call the conditional probability t h a t a hypothesis is true given a negative result the "negative predictive value" (NPV): NPV = ?ß/[Pß + (1-Ρ)(1-α)]. T h e N P V increases w i t h α, β a n d the prior probability P. N o t surprisingly, the PPV is larger t h a n the N P V for all levels of Ρ as long as (l-a)(l-/?) > aß. This relation holds for conventional values of a a n d β, a n d , of course, a test w o u l d not m a k e m u c h sense if this relation w a s not fulfilled. Since the PPV is usually larger t h a n the NPV, replications of significant findings are m o r e efficient t h a n replications of negative results. Applying a test is - so t o speak - like looking for gold nuggets in a pile of rubble. The 'significant pile' m a y contain a m a j o r i t y of rubble. Nevertheless, the p r o p o r t i o n of nuggets is higher in the significant pile t h a n in the pile of negative results. However, for b o t h piles, repeated search increases the likelihood of finding the nuggets. Of course, replication strategies also depend on theoretical progress. A new theory might shed new light on a hypothesis t h a t did not pass the level of significance in a f o r m e r test. O r the f o r m e r test m a y have been flawed because of i n a p p r o p r i a t e methods. O r the hypothesis is of great theoretical or practical value. In these cases, focusing o n replications of non-significant results might pay off. Although replications are the essence of a cumulative science, there is unfortunately little incentive t o actually carry out replicative research. Replications contribute to the collective good of cumulative science but free riding is the d o m i n a n t strategy for most career-oriented researchers 6 . Of course, one c a n n o t blame researchers for not devoting their time t o less r e w a r d i n g research activities. Moreover, editors are often biased against publishing replication studies a n d it is very likely t h a t publication bias exists regarding replications 7 . As with other social dilemmas or public-good problems such as climate 6

7

See Kerr (1998) for HARKingas a "rational choice" given the incentivized structure of the scientific profession. For applying the social-dilemma argument to publication bias, see Auspurg and Hinz in this volume. For example, Diekmann and Przepiorka (2007) replicated an experiment conducted by Miller et al. (1998). Based on identity theory, Miller reported in the European Journal of Social Psychology that subjects having the same birthday elicit more cooperative behavior in a prisoner's dilemma game than subjects not sharing this characteristic. However, in our replication with a trust game and with the prisoner's dilemma we did not find any significant difference. The editor of the Journal of European Psychology refused to publish our replication. One can consider this as a case of unethical behavior by an editor. The editorial decision clearly produces a publication bias.

634 · Andreas Diekmann

change, overfishing or traffic jams, external regulations and "selective incentives" are required to promote the production o f the public good of replications (see Auspurg/ Hinz in this volume). There are several measures to promote and encourage replications: - Stimulating awareness of the problems of significance testing and the importance of replications should be part of statistics and methods courses for students and researchers. - Conducting replications should become a standard and obligatory part of academic education. For example, doctoral dissertations in economics are often cumulative, comprising several unpublished and published articles. Supervisors and dissertation regulations should encourage including a replication study in cumulative dissertations. - Journals should demand that authors give access to well-documented data files. The best way is to host the documented data file on the server of the journal. - Journals should include a replication section and they should encourage the submission of replication studies (as is the policy of the J E S ) . - Funding agencies should reserve a proportion o f their budgets for supporting replication studies. If we want to prevent a vast number of artifacts being accumulated in professional journals and textbooks, the incentive structure for conducting replicative studies in empirical sciences has to be changed by these or other measures. It is well known that researchers are less motivated to do replicative studies because original work is rewarded more. N o t e , however, that it is often possible to extend a replication by enlarging the design of an experiment, adding a new " t r e a t m e n t " , considering an alternative method or including further variables in the analysis that therefore go above and beyond simply replicating the original study. Such studies are both replicative and original and, moreover, they have an important role to play in reducing the many errors in science.

References Dewald, W. G., J. G. Thursby, R. G. Anderson (1986), Replication in Empirical Economics. The Journal of Money, Credit and Banking Project. The American Economic Review 76: 587-603. Diekmann, Α., W. Przepioka (2007), Does Sharing the Birthday Really Increase Cooperation? Evidence From Replications. Mimeo: ΕΤΗ Zurich. Freese, J. (2007), Replication Standards for Quantitative Social Science. Sociological Methods and Research 36: 153-172. Ioannidis, J. P. A. (2005), Why Most Published Research Findings are False. PLoS Med 2: 696701. Kanazawa, S. (2006), Beautiful Parents Have More Daughters. A Further Implication of the Generalized Trivers-Willard Hypothesis. Journal of Theoretical Biology 244: 133-140. Kane, E.J. (1984), Why Journal Editors Should Encourage the Replication of Applied Econometric Research. Quarterly Journals of Business and Economics 23: 3-8. Kerr, N.L. (1998), HARKing: Hypothesizing After the Results are Known. Personality and Social Psychology Review 2: 196-217. Krämer, W., H. Sonnberger, J. Maurer, P. Havlik (1985), Diagnostic Checking in Practice. The Review of Economics and Statistics 67: 118-123. Miller, D.T., J . S . Downs, D.A. Prentice (1998), Minimal Conditions for the Creation of Unit Relationship: The Social Bond Between Birthdaymates. European Journal of Social Psychology 28: 475-481.

Are Most Published Research Findings False? · 635

Moonesinghe, R., M . J . Khoury, A. C.J.W. Janssens (2007), M o s t Published Research Findings are False - But a Little Replication Goes a Long Way. PloS Medicine 4: 218-221. N g , T. P., P.-C. Chiam, T. Lee, H.-C. Chua, L. Lim, E.-H. Kua (2006), Curry Consumption and Cognitive Function in the Elderly. American Journal of Epidemiology 164: 898-906. Popper, K. R., (1979), The Bucket and the Searchlight: Two Theories of Knowledge. S. 341-361 in: Popper, K. R. (Hrsg.), Objective Knowledge. An Evolutionary Approach. O x f o r d . Rosser, J. C. Jr., P.J. Lynch, L. Cuddihy, D. A. Gentile, J. Klonsky, R. Merrell (2007), The Impact of Video Games on Training Surgeons in the 21 s t Century. Arch. Surg. 142: 181-186. Ruebeck, C. S., J. E. Harrington, Jr., R. M o f f i t t (2006), Handedness and Earnings. NBER Working paper series. N a t i o n a l Bureau of Economic Research (NBER), Cambridge, M A . Tabarrok, Α., (2005), W h y M o s t Published Research Findings are False. MarginalRevolution: http://marginalrevolution.com/marginalrevolution/2005/09/why_most_publis.html Prof. Dr. Andreas Diekmann, Ε Τ Η Zürich - Soziologie, CLU D 3, Clausiusstrasse 50, 8092 Zurich, Schweiz. [email protected]

Jahrbücherf. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011 ) Bd. (Vol.) 231/5+6

What Fuels Publication Bias? Theoretical and Empirical Analyses of Risk Factors Using the Caliper Test By Katrin Auspurg and Thomas Hinz, Konstanz* JEL C10; C12; C18 Significance testing; publication bias; caliper test; rational choice; sociology of science.

Summary Significance tests were originally developed to enable more objective evaluations of research results. Yet the strong orientation towards statistical significance encourages biased results, a phenomenon termed "publication bias". Publication bias occurs whenever the likelihood or time-lag of publication, or the prominence, language, impact factor of journal space or the citation rate of studies depend on the direction and significance of research findings. Although there is much evidence concerning the existence of publication bias in all scientific disciplines and although its detrimental consequences for the progress of the sciences have been known for a long time, all attempts to eliminate the bias have failed. The present article reviews the history and logic of significance testing, the state of research on publication bias, and existing practical recommendations. After demonstrating that more systematical research on the risk factors of publication bias is needed, the paper suggests two new directions for publication bias research. First, a more comprehensive theoretical model based on theories of rational choice and economics as well as on the sociology of science is sketched out. Publication bias is recognized as the outcome of a social dilemma that cannot be overcome by moral pleas alone. Second, detection methods for publication bias going beyond meta-analysis, ones that are more suitable for testing causal hypotheses, are discussed. In particular, the "caliper test" seems well-suited for conducting theoretically motivated comparisons across heterogeneous research fields like sociology. Its potential is demonstrated by testing hypotheses on (a) the relevance of explicitly vs. implicitly stated research propositions and on (b) the relevance of the number of authors on incidence rates of publication bias in 50 papers published in leading German sociology journals.

Does it make a difference whether the p-value of a statistical t-test is 0 . 0 5 1 or 0 . 0 4 9 ? In most cases only a marginal change in the data set, such as a small increase of case numbers, is needed to shift the estimated p-value below the 0 . 0 5 barrier, which is usually defined as the threshold for statistical "significance". Bearing this in mind, the sharp dichotomization between "significant" and "non-significant" results seems extremely problematic. In addition, there is no theoretical justification for the threshold: the 0 . 0 5 or 5 percent barrier in truth represents a social convention. Nevertheless, there is ample evidence that significant results and those fitting the theoretical expectations are much more likely to be published and cited than other results (see, for example,

* We would like to acknowledge the valuable comments made by Andreas Diekmann, Willi Nagl, Bernd Weiss, and t w o anonymous reviewers on an earlier version of this paper. Anja Joos, Cornelius Gross and Konstantin Mozer supported our data collection.

What Fuels Publication Bias? · 637

Dickersin 2005 or Wagner/Weiss in this volume for reviews). Whenever this preference for significant results is not justified by high scientific quality, the selective omission of other results causes a bias, called "publication bias", "publication selection bias" or "outcome bias" (Begg 1994; Begg/Berlin 1988). In the present article, we will define publication bias more broadly as the bias that occurs whenever the likelihood that research findings are submitted to journals or conferences, or the likelihood that they are accepted for publication, especially in journals with high impact factors, or cited by other authors, depend on the direction, strength or statistical significance of research findings (Müller/Jennions 2001 for a similar definition). Publication bias clearly has detrimental effects on the progress of science. In extreme cases, purely non-existent phenomena or artifacts are "supported" by empirical data (Diekmann in this volume; Fanelli 2010a). In less extreme cases, the bias leads to an overestimation of effect sizes and their precision. The phenomenon of publication bias has already been introduced by Wagner and Weiss in this volume. 1 Although many recommendations for preventing the bias are discussed (see section 1.4), there is still strong evidence for its occurrence in all disciplines using statistical significance tests, such as psychology (Coursol/Wagner 1986; Smart 1964; Sterling 1959), political sciences (Gerber/Malhotra 2008a), sociology (Gerber/Malhotra 2008b), educational research (Torgerson 2006), and economics (Ashenfelter et al. 1999; De Long/Lang 1992; Roberts/Stanley 2005; Stanley 2005; Stanley/Doucouliagos 2007; see also the special issue in the Journal of Economic Surveys 2005, Volume 19, 3). One possibility for dealing with the problem is to accept its occurrence and to focus on correcting it. There are meaningful tools for "ex-post" corrections within the frame of metaanalysis (see Wagner/Weiss in this volume), but unfortunately these methods rely on large case numbers of original studies and consequently do not work within research areas where there are only few studies on single issues, as it is the case in all innovative research fields and many sub-disciplines of social sciences such as sociology. In addition, the validity of meta-analyses relies on assumptions that are often not fulfilled. For these reasons, a more effective "ex-ante" prevention of the bias is preferable. There is another strong argument for more research on publication bias: as will be shown in more detail in the present article, scientists are more likely to engage in biased analyses than in fraud or deliberate falsification of data. Hence, publication bias constitutes the obviously more dominant, and thus more widespread, form of misconduct in statistical analyses (Fanelli 2010a; Feigenbaum/Levy 1996; Stephan 1996). 2 1

2

According to Dickersin (1990: 1385), the term "publication bias" was first used in the scientific literature by Mary Lee Smith in 1980 (Smith 1980). Another source of publication bias exists in reporting on results that are significant or fit to research hypotheses in an advocacy style. Considerations of publication bias date back to the ancient world, with one of the first analysts of publication bias being made by the Greek philosopher and agnostic Diagoras of Melos, who lived around 500 BC. According to legend, Diagoras was showed a votive temple on the Island of Samotrace with portraits of those who successfully escaped from shipwreck. Asked if this was not strong proof that gods really existed and positively intervened in human affairs, Diagoras pointed to the missing portraits of those that prayed, but nevertheless drowned (Scargle 2000: 94). There are some reasons for considering the bias itself as a kind of misconduct (e.g. Chalmers 1990). Following common definitions, the deception would have to be intentional to count the bias as misuse of statistics (e.g. Gardenier/Resnik 2002). However, it is still unclear how far publication bias is caused by the unconscious or intentional actions of scientists and the idea of simple fraud will not help to overcome the problem. As the discussion in section 2 shows, not only authors but other actors like reviewers, editors, and even recipients contribute to the phenomenon.

638 • Katrin Auspurg and Thomas Hinz

T h e main concern of the present article is to suggest a new research agenda based on a comprehensive theoretical framework extending rational choice theory with concepts from sociology and the economics of science. T h e central point is that publication bias can be seen as the outcome of a social dilemma. This social dilemma cannot be overcome by reliance that all actors by default behave according to scientific norms. W h a t is needed instead is a deeper investigation into how the incentive structures of involved actors influence the publication process. In addition, an innovative empirical approach, termed the "caliper t e s t " , which is not bound to meta-analyses, is introduced and for the first time applied to the analysis of causal assumptions.

1. Background and state of research 1.1 The logic and cult of statistical significance T h e dominant approach in social sciences for the analysis of theories by statistical methods is to specify one or more empirical hypotheses on parameter values in the population (for example, effect sizes or associations between variables) and to test these hypotheses against a random data sample. A " h y b r i d " of two different significance tests proposed in the 1 9 3 0 s by Ronald A. Fisher (e.g. 1 9 2 5 ) and by Jerzy Neyman and Egon S. Pearson (e.g. 1 9 3 3 ) is usually applied to test these hypotheses (for more details: Gigerenzer et al. 1 9 8 9 ; Gill 1 9 9 9 ; Kline 2 0 0 4 ) . This hybrid consists of testing a "null-hypothesis" against a "research hypothesis". In the most c o m m o n and simplest case, the null-hypothesis is specified as the parameter taking the value of zero in the population; the research hypothesis states that there is a non-zero parameter value. 3 For instance, the null-hypotheses might state that there is no correlation (Ho: ρ = 0) between two variables, while the research hypothesis might state that there is a (positive or negative) correlation ( H i : ρ φ 0). Following Fisher, "significance values" consist in the conditional probability o f observing values of the test statistics that are at least as extreme as those observed with the realized random sample. This probability is denoted as p-values. Although not originally proposed by Fisher, it became c o m m o n practice to use strict thresholds a to decide whether to reject or not reject the null-hypothesis (and so to accept the research hypothesis). Typically, threshold values of 10-, 5-, or 1- percent are defined, meaning that an estimated parameter value is only considered to be "statistically significant" when the calculated p-value < . 1 , .05 or . 0 1 . These significance levels represent the " t y p e - I " error rate (the risk of falsely rejecting the null-hypothesis). To obtain the p-values, a statistical test-value has first to be determined (t0bs)· Usually, the test-value is obtained by dividing the parameter estimate by its standard error (se). For instance, in the case of regression analyses, the test-value (t0bs-reg) is determined by dividing the regression coefficient (/?reg) by its standard error, as it is shown in equation (1). ¿obs-reg =

(1)

In the case of two-sided significance testing (where the hypotheses state no special direction for the expected parameter value), the p-value represents the probability of obser3

The structure of hypotheses can obviously be more complex: there are hypotheses about specific functional forms, and often two or more parameters are considered simultaneously. Sometimes, the null hypothesis is "one-sided": this occurs when the parameter value is greater (or smaller) than a specific value.

W h a t Fuels Publication Bias? • 639

ving an absolute test-value at least as extreme as the a-quantile of the sample distribution of test values. This is formalized in equation (2), with t denoting the sample distribution of test values and df the degrees of freedom: Ρ = P(tobs-reg I Ho is true) = Ρ (Itobs-regl ^

lt(l-o/2),

dfO

(2)

The significance thresholds a are only arbitrary levels, since there is no statistical justification for any threshold. Another weakness of the significance values is that they indicate the likelihood of only one of two possible kinds of errors that might occur given the assumption that the null-hypothesis is true. In addition to the type-I error (a "false positive rate" of accepting the research hypotheses), there is a type-II error (the "false negative rate" of wrongly accepting the null-hypothesis; see Diekmann in this volume for more details). 4 Usually, one should be interested in minimizing type-I and type-II error rates at the same time. One of the main criticisms of the current practice of null-hypothesis significance is that it provides no statistical measure summarizing both rates of errors. 5 Despite a pronounced skepticism about the "null ritual" (see, for example, Cohen 1994; Gill, 1999; Kline 2004; Ziliak/McCloskey 2008), the described form of null-hypothesis significance testing and, especially, the reliance on strict significance thresholds - has become increasingly popular (Gigerenzer et al. 1989; Leahey 2005). The reliance on strict significance thresholds seemed to remove all subjective evaluation from the process of inference. Moreover, the "null ritual" has provided different fields of research with a common language (Kline 2004: 8). Significance values are, however, less objective and standardized than is often assumed. As can be seen from the formulas, they strongly depend on the standard errors and the magnitude of effect sizes. The smaller the standard errors and therefore in general the larger the number of observations, the lower the significance values are. Every effect size will be statistically significant if the sample size is increased enough. It is an illusion that significance testing involves no subjective evaluation of results (Gigerenzer et al. 1989: 288; Rosnow/Rosenthal 1989). Nevertheless, the null-hypothesis significance (usually symbolized with one to three asterisks in tables) is regarded as one of the most important criteria when experts decide if results are important or not. After becoming established in top journals and used by authors from high prestigious institutions like Harvard or Stanford, the practice diffused rapidly into all fields of social sciences (Leahey 2005 for sociology). Meanwhile, more than 80 percent of articles published in leading journals in the fields of psychology and sociology eligible to use significance tests do so (Gigerenzer et al. 1989: 206; Leahey 2005: 12). In particular, this "cult of statistical significance" might have caused a special bias by impairing the validity of overall results. Significance values deliver valid information only in cases where all results, no matter if significant or not, have the same chance of being published. There is, however, strong evidence that results fitting theoretical 4

5

The probability that the significance test correctly leads to the acceptance of the research-hypothesis is termed the "power" of statistical tests; and the power is the exact complement probability of a type-II error. The lack of such a comprehensive indicator is historically explainable by the hybrid form of significance testing. The idea of type-I errors is based on the work of Fisher, while the idea of type-II errors is derived from the independently developed approach of Neyman and Pearson - an approach that originally represented a counter-proposal to Fisher (see, for example, Gigerenzer et al. 1989; Gill 1999; Kline 2 0 0 4 for more details).

640 · Katrin Auspurg and Thomas Hinz

assumptions and reaching statistical significance are much more likely written down, submitted to journals, and accepted for publication in (prestigious) journals than other results. The symptoms and consequences of such practices are obvious from equation (1). Due to sample variation, measured effect sizes always vary across different samples. But in cases of small true effects or small sample sizes (large standard errors), only inflated effect sizes reach statistical significance. 6 If there is a strong publication bias, only the 5 percent of type-I errors that are found by chance are published, while the 95 percent of zero-findings are not (Rosenthal 1979: 638). As a result, an illusory (treatment) effect would be assumed to be occurring where in fact none exists. Moreover, even if there is a genuine effect beyond the bias, the magnitude of the estimated overall effect size may be massively exaggerated. Additionally, the precision of research findings in meta-analyses will be overestimated, leading to overconfident assessments of the validity of theories. In any case, there is a high risk of type-I errors being converted into theories (Kerr 1998; Labovitz 1972). Paradoxically, it is principally the definition of strict thresholds for "statistical significance", which is not justified by statistical theory, that stimulates the occurrence of publication bias. 1.2 Sources of publication bias

Publication bias is basically caused by three sources. First, results not fitting theoretical predictions can be underreported (this is sometimes termed "objective publication bias"; see, for example, Begg 1994). Second, desired (significant) results can be triggered by continuously re-analyzing data until the preferred outcomes are found or by adapting the significance levels to the results (strategies that might be labeled a "subjective publication bias"; Begg 1994). The social sciences in particular provide many opportunities for the manipulation of results since theories and methods are relatively flexible and open to subjective interpretations (Fanelli 2010a; Gerber/Malhotra 2008b). Significant results can, for instance, easily be triggered by adapting significance levels (e. g. changing from a 1- to a 5-percent level) or by shifting from a two-sided to a one-sided significance test. Alternative ways to promote preferred outcomes consist of collapsing categories or eliminating influential data points (Kerr 1998; Labovitz 1972: 33). Third, publication bias can be caused by adapting hypotheses to the results (sometimes referred to as HARKing: Hypotheses After the Results are Known; Kerr 1998). There is, meanwhile, broad evidence for publication bias caused by one or all strategies in all sub-fields of social sciences (see Wagner/Weiss in this volume). But despite longstanding criticism of the problem, it has not disappeared. Deeper insights into its causes are needed to develop more effective preventions. In the next two sections, we summarize the state of research on causal factors and the most prominent practical recommendations for dealing with the bias. 1.3 State of research

So far, research on publication bias has mainly concentrated on detection and correction by statistical approaches. Generally, detection methods based on meta-analyses are applied (for example, funnel plots, trim-and-fill-methods, and correlation and regression

6

Because of the demonstrated statistical background, publication bias will be partly a function of sample size. Small studies will have an increased risk of publication bias (Begg/Berlin 1988).

What Fuels Publication Bias? · 641

analysis). These methods are introduced in detail in Wagner and Weiss in this volume, together with their most important findings. While meta-analyses offer appealing methods for the identification and correction of bias in large and homogeneous samples (the purpose they were originally developed for), their diagnostics have questionable validity in other settings. Recall that one of the main manifestations of publication bias is a negative association between effect and sample sizes; testing for such an association is the main idea of most approaches based on meta-analyses. Yet stronger effects in smaller studies do not necessarily stem from publication bias. They can also represent a genuine empirical effect, since, for instance, intervention programs might produce extraordinarily strong effects within small groups of high-risk individuals (Sterne et al. 2005: 89 for evidence in medical research). Therefore, a valid assessment of publication bias requires that all alternative explanations for the correlation between sample sizes and effect sizes are ruled out. Moreover, sufficient power to detect publication bias exists only with large numbers of original studies (at least 20 studies seems to be a rough guideline, but this is only true if the original studies resemble each other to a high degree; see, for example, Sterne et al. 2000 for simulation studies). Therefore, deep knowledge of the extent of publication bias exists only in scientific sub-fields where meta-analyses are applicable, which are mostly areas of economic, psychological and biomedical research (Ashenfelter et al. 1999; Dickersin, 2005; Stanley 2005; Stanley/Doucouliagos 2007). In other sub-fields of the social sciences, the knowledge hardly extends beyond the idea that there is presumably some bias, but with no understanding of its extent or impact. 7 In addition, meta-analyses-based indications of publication bias are usually not standardized and therefore not comparable across different meta-analyses. As a result, empirical knowledge of the risk factors for publication bias is still rare. The causes of the bias have been investigated by some psychologists. The main idea is that researchers, like all human beings, have an innate tendency to confirm their expectations and hypotheses (Fanelli 2010a: 2). Confirmations of one's own expectations and significant results might be used as proof by researchers, reviewers, editors and readers that the procedure and findings are sound (Rosnow/Rosenthal 1989). 8 There is already strong evidence from surveys and experiments that researchers consider significant outcomes as "better", more "valuable" or "publishable" than non-significant outcomes (Chase 1970; Fanelli 2010b: 5; Hedges 1992: 249; Rosnow/Rosenthal 1989; Skipper et al. 1967). However, publication bias is certainly not completely caused by unconscious behavior. It is probably to a much larger degree caused by the deliberate actions of authors, editors and reviewers. Some editors have directly stated that the statistical significance of results is a very important criterion for the acceptance of articles (Gigerenzer et al. 1989: 206). One attempt to identify the actors and conditions most susceptible for publication bias consisted in comparing different publication stages. Part of these studies followed the publication fate of a group of pre-registered studies in medicine or of abstracts submitted 7

8

According to a review article in sociology, on average only eight meta-analyses per year were conducted between 1975 and 2005 that are recorded in the Social Science Citation Index (SSCI)(Weiss/ Wagner 2008). Furthermore, these few meta-analyses have already been criticized for invalid results due to the level of (unobserved) heterogeneity across the original studies (Brüderl 2004; Sharpe 1997). The difference between a significant or non-significant finding may be interpreted as right versus wrong, or as success versus failure (Skipper et al. 1967).

642 · Katrin Auspurg and Thomas Hinz

to conferences (see, for example, Callaham et al. 1 9 9 8 ; Dickersin 1 9 9 0 ; Dickersin et al. 1992; Dickersin/Min 1 9 9 3 for medicine). Other studies compared the proportions of significant results across published and unpublished work (they used working papers or dissertations for the unpublished component) (Smart 1964 for psychology). Another strategy to track down unpublished work involved surveying prospective authors (Dickersin et al. 1 9 9 2 for medicine). Additionally, manuscripts submitted to journals were used as a sampling frame (Sahner 1 9 8 2 for manuscripts submitted to a German sociological journal). The results consistently indicated that - contrary to popular opinion publication bias primarily originated with the investigators themselves and not with the reviewers and editors (Dickersin et al. 1992: 3 7 4 ; Dwan et al. 2 0 0 8 ; Maller/Jennions 2 0 0 1 : 585). Another aspect studied was the time of publication. As was expected, significant results were mainly found to be published within shorter time-frames than other results, leading to a decline over time in the overall effect sizes (Dickersin 2 0 0 5 ; Dwan et al. 2 0 0 8 ; M0ller/Jennions2OOl: 383). Furthermore, there was evidence suggesting that non-native English-speakers invested the effort of reporting in English if the results had especially high chances of being published (Dickersin 2 0 0 5 ; Grégoire et al. 1995). In addition, primarily impressive and significant findings were submitted to high-impact journals (Baker/Jackson 2 0 0 6 ; Chalmers 1990; Dickersin 2 0 0 5 : 23). Daniele Fanelli (2010b) demonstrated that more competitive academic environments lead to a higher frequency of positive results. 9 In another study, he compared the outcomes of papers from different scientific disciplines registered in the Essential Science Indicators (ESI) database. The results confirmed his main assumption of higher incidence rates for publication bias in " s o f t " sciences (the social sciences) rather than in " h a r d " sciences (the natural sciences). This was expected because the former are characterized by the presence of less methodological consensus on testing research hypotheses. 1 0 Similarly, some authors have assumed or shown that the proportions of significant results declined with the number of hypotheses that were tested by single authors (Fanelli 2 0 1 0 a ; Hunter/Schmidt 2 0 0 4 ; Wilson et al. 1973). In addition, the source of funding was shown to matter. Several studies reported that the results of experiments funded by pharmaceutical industry favored new therapies more strongly compared to those experiments that were supported by public sources of funding (Dickersin 1 9 9 0 , 2 0 0 5 ) . These studies are conclusive for some risk factors (Begg/Berlin 1 9 8 8 ; Dickersin 2 0 0 5 for more comprehensive reviews). But, unfortunately, the observed patterns might be explained by other conditions than publication bias. Papers accepted versus papers non-accepted in peer-reviewed journals or accepted in journals with higher versus lower impact factors should naturally differ in scientific quality. Higher proportions of significant results might therefore only indicate better developed theoretical hypotheses or more exact measurement methods rather than publication bias. Similarly, the other group differences are confounded with issues of quality. Research institutions with more output per capita might be those that conduct studies of comparatively high quality (Fanelli 2 0 1 0 b ) ; non-publishing of dissertation results might be explainable by the low research expertise of their authors and not by their failing to demonstrate significant 9

10

This was done by comparing papers published in different US states. Across all scientific disciplines, papers more likely supported the tested hypotheses when they were published in states with a c o m paratively high output of working papers per capita (this feature served as a p r o x y for competitiveness), even when variables like the state's financial investments in research were controlled. For example, the odds of reporting positive results were found t o be about twice as large for papers in social sciences compared to those in physical sciences (Fanelli 2 0 1 0 a ) .

W h a t Fuels Publication Bias? · 643

results (Kulik/Kulik 1989: 273). Without objectively assessing the quality of research, the conclusions of all these group comparisons remain somewhat tentative. A further suggestion is that researchers conducting meta-analyses rate the quality of the original papers (Torgerson 2006: 95), but it seems very unlikely that meta-analysts accomplish a better quality assessment than the peer reviewer (Sharpe 1997: 887). 11 Another possibility for assessing the sources of publication bias consists in surveying researchers, reviewers and editors (Chase 1970; Hedges 1992: 249; Neuliep/Crandall 1990, 1993). This approach very likely suffers from social desirability bias. At the minimum, deliberate attempts to tweak significant results will be underreported. Experimental methods certainly provide more valid results. To our knowledge, there have been two field experiments in this area (see also Dickersin 2005: 17). These used similar versions of manuscripts with on the one hand positive (hypothesis-confirming) and on the other hand negative (hypothesis-rejecting) results (Epstein 1990; Mahoney 1977). Both studies found clear evidence for publication bias. Those experiments, however, are certainly not widely applicable. Another shortcoming of the existing research is the lack of a substantive theoretical framework. The existing studies mainly tested singular hypotheses which either were only vaguely derived from plausible arguments or were only justified by referring to some broad theoretical concepts. Without a clear theory-guided approach, the outcomes themselves are at risk of being biased (Diekmann in this volume; see Dubben/Beck-Bornholdt 2005 for research on publication bias in studies on publication bias). Testing single hypotheses always provides more possibilities to trigger desired effects than testing multiple hypotheses derived from one coherent theoretical framework (Hunter/Schmidt 2004). To sum up, there is a continuing lack of understanding of the causes of publication bias and their interrelatedness to institutional and context factors. As a consequence, it is difficult to create effective interventions that are not affected by unintended side-effects. 1.4 Shortcomings of practical recommendations

The most prominent ideas discussed so far are obliging all research hypotheses to be pre-registered prior to their investigation (as they are already in some fields of medical research) and establishing journals for negative results (i.e. journals that only accept non-significant findings or those contradicting the research hypotheses; Begg 1994; Begg/Berlin 1988; Callaham et al. 1998). Pre-registrations are presumably not effective in disciplines where investigators do not rely on their own costly data collections (like all studies based on secondary data), because in these research fields it is often not possible to verify what came first: the tests of hypotheses by existing data or their registration. Journals for negative results themselves have the counter-productive side-effect of explicitly favoring one kind of result. Another recommendation would be to launch a journal of replication studies to give replications more credit. Some authors proposed reviews that are results-blind (e.g. Sterling et al. 1995): different reviewers should assess the quality of different sections of manuscripts, so that emphasis could be put on the strength 11

The recommendation to correct meta-analysis by simply including all research ever conducted, i. e. also all working papers and other non-published studies, is affected by the same shortcoming in terms of quality problems (a criticism that has been vividly called "garbage in"; see, for instance, Sharpe 1997: 882). Furthermore, it would be virtually impossible to track down all research work ever done (Scargle 2000).

644 • Katrin Auspurg and Thomas Hinz of methodology and theoretical argumentation independent of the results. This procedure, however, threatens the quality of the whole peer review process, since reading only sections of papers does not make it possible to assess the rigor of the link between theory and empirical tests, opening the way to HARKing or ex-post-facto adjustments of significance levels. Another recommendation is to draw inferences mainly on results based on large sample sizes, since these are less prone to publication bias (Stanley et al. 2010 for the extreme recommendation to consider only the 10 percent reported results with smallest standard errors). This procedure has the problematic side-effect of blocking innovative research (which is typically conducted with smaller sample sizes) and of hampering research on small populations (such as people suffering from rare diseases or showing special kinds of deviant behavior). Additionally, the favoring of large samples could advantage actors with many research resources, thereby triggering the problematic Mathew-e.iie.ct in science (Merton 1968,1988). Also too short-sighted are recommendations to use impact factors or inclusions of all unpublished work to correct for publication bias in meta-analyses (see Baker/Jackson 2006; Torgerson 2006). Even if these procedures could effectively reduce the bias, the price of impairing overall results by giving heavy emphasis to studies of doubtful quality seems too high. And, moreover, most of these recommendations are restricted to meta-analyses. In other research areas, the advice so far hardly goes beyond appeals to follow the norms of good scientific conduct (see, for example, Chalmers 1990; Dickersin 1990). But despite the widespread teaching on the adverse consequences of publication bias and the fact that it is widely recognized that the proper use of statistical methods is a key element of research integrity (Gardenier/Resnik 2002), the bias still exists. We urgently need a more precise understanding of the motives of the diverse actors engaging in publication bias. The remainder of this article sketches out a research agenda for achieving this goal.

2

New impulses for publication bias research

2.1 Towards a more comprehensive theoretical framework The review on the existing research so far has showed that we need a more thorough theoretical framework explaining the occurrence of publication bias. Rational choice theory could build a promising starting point for such an endeavor, especially in its extension to game theoretical approaches that are explicitly targeted at modeling the inter-dependent behavior of strategic actors. The occurrence of publication bias could be considered a situation where the incentives of investigators and editors to achieve the maximum reputation and career success with a minimum of effort conflict with the interest of the public in the advancement of scientific progress. In terms of rational choice, such a situation would represent a social dilemma (Kerr 1998) that cannot be overcome by moral appeals for research integrity. It rather needs a profound analysis of the incentive structures of the diverse actors involved. The psychological tendency to seek expectation-confirming results could be integrated into such a framework. Additionally, theories from the sociology and economics of science provide useful concepts for understanding the reward structure in science and the behavior that it stimulates. In particular, theories about limited rationality due to information problems (like theories on statistical discrimination and signaling) already explain well other forms of bias (Auspurg et al. 2008 for some literature on science). In the following sections, a first outline of such a framework is presented. As a first step, the incentive structures of the various

What Fuels Publication Bias? · 645

actors involved in the publication process are analyzed. T h e second step is to explore how these incentive structures interact with each other. Finally, some concrete hypotheses on the conditions that fuel publication bias are derived. T h e basic idea is that rational researchers compete for sparse journal space and weigh the costs (time and research money) against the returns o f the publications (Feigenbaum/ Levy 1 9 9 3 ) . In the "publish or perish" environment in academia, publications are ever more important for fostering and sustaining scientific careers (see, for example, Graber et al. 2 0 0 8 for recent results on economics). T h e importance of a scientist's contribution is often measured by counting the citations of his or her work (Stephan 1 9 9 6 : 1 2 0 1 ) . Publications not only improve the visibility and reputation of investigators, but also represent an increasingly important prerequisite for faculty positions and research funding. In consequence, the rewards for publishing a paper are usually high (Feigenbaum/Levy 1 9 9 3 ) . Only a small proportion, however, of submitted manuscripts are accepted for publication. 1 2 An additional attribute of science is that new results especially are rewarded - reputation is primarily gained by the "priority of discovery" (Merton 1 9 5 7 , 1 9 6 8 , 1 9 8 8 ; Stephan 1 9 9 6 : 1 2 0 1 ) . In a heuristic view, the scientific market can be seen as a "winner-takes-all" contest, which is won by whoever is first with their results (Stephan 1 9 9 6 : 1 2 0 2 ) . For this reason too, authors will rush their results to a journal. Rational investigators will try to win the competition with a minimum expenditure of resources. In addition, they have to ensure that their work has no serious shortcomings that could jeopardize their reputation. For both purposes, they will try to anticipate which results are likely to be most impressive to reviewers and editors and how likely possible failures are to be detected. An editor, in contrast, is mainly interested in frequent citations of highly reputable articles, since frequent citations increase the journal's prestige and attract more readers, authors and subscribers (Hojat et al. 2 0 0 3 : 9 1 ) . To decide which papers to accept, he or she will try to anticipate the interest of readers (who will probably be more interested in new and impressive results) and rely on the quality rating of peer reviewers. In terms of the attention gained from citations, it is already known that significant and theory-confirming results are more often cited by other authors (Meller/Jennions 2 0 0 1 for an overview; Chalmers 1 9 9 0 ) . 1 3 It is very likely that editors would tend to favor those papers that attract most interest (Fanelli 2 0 1 0 b ) . In addition, editors are apparently also influenced by unconscious psychological perceptions: several surveys have demonstrated that they put more trust in significant than in insignificant results (see section 1.3). Reviewers make up the third group involved in the publication process: they are mostly responsible for rating the quality of manuscripts. Analyzing the incentive structure of reviewers again leads to the conclusion that the quality of published work is not maximized. T h e primary interest of reviewers is maximizing reputation while minimizing effort. Reviewing is certainly a time-consuming and labor-intensive task (see, for example, H o j a t et al. 2 0 0 3 : 86 for statistics on time requirements). In particular, with many papers submitted in very competitive research fields, reviewers have a high workload and have to evaluate papers of very similar quality, making it difficult to decide which piece of work is most worth publishing. An additional hurdle for reviewers (and readers) in 12

13

Hojat et al. report that, on average, just 2 7 percent of submitted manuscripts were accepted in recent years by the 25 journals of the American Psychological Association (APA) (Hojat et al. 2003: 75 et seq.). Callaham et al. 2002, however, find no association between positive outcomes and citation rates.

646 · Katrin Auspurg and Thomas Hinz

assessing the true quality of papers is the problem of asymmetric information: reviewers inevitably have less insight into the research done than the original authors, since, for instance, they often do not have access to the data used. Here, theories on rational behavior in situations of imperfect information apply, for example theories of statistical discrimination (Arrow 1973; Phelps 1972) or signaling (Spence 1973). These suggest that reviewers will solve the information problem by relying on proxies to indicate the quality of research work, including the status and reputation of authors (Auspurg et al. 2008). Similarly, one might expect that the reviewers will use the strength and significance of results as a proxy for quality, especially if the author is not yet known in the research community. To put this in a somewhat extreme way: many asterisks indicating "significant" results might serve as a hint for meaningful work, and the confirmation instead of the rejection of theories might boost trust that all the empirical work was well done (Rosnow/Rosenthal 1989; Skipper et al. 1967). Taken together, the main prediction is that - given the current system of science - the occurrence of publication bias is the very likely outcome of the interaction of all actors involved in the publication process (researchers, editors, reviewers and readers). Rational authors and editors have clear incentives to favor significant results. In addition, self-reinforcing feedback-mechanisms are likely to be produced. If higher acceptance rates for papers with significant results are anticipated by authors, their obsession with producing and submitting results with many significance "stars" will be fostered even further. Of course, most scientists will also subscribe to ethical codes prohibiting biased analyses and some scientists will be motivated mostly by a disinterested search for truth (Diamond 1996). 1 4 But if the science market rewards biased results, researchers that primarily tend to orient their work towards "significance" will survive. Selections of certain outcomes or manipulations of results will only be detected when reviewers criticize the violation of statistical assumptions or when other researchers fail to replicate the results. Both aspects are, however, very unlikely. Due to the problem of asymmetric information and time constraints, there is only a small risk that reviewers will detect the misuse of statistics. And due to extremely high rewards for the priority of new results, replication studies are rarely done or published (Feigenbaum/Levy 1993). 15 Such a situation represents a social dilemma (Kerr 1998: 213; Levy/Peart 2008; for general literature on dilemmas: Dawes 1980; Kollock 1998): the interaction of all actors leads to publication bias being the dominant outcome, therefore representing a situation that deviates from the social optimum of unbiased and effective scientific progress and academic careers in compliance with ethical guidelines.16 Theoretical as well as empirical research demonstrates that social dilemmas cannot be overcome by moral appeals to respect scientific norms alone, but need supplementary shifts of the incentive structures (or "pay-offs") for all actors involved (see, for example, Dawes 1980; Kollock 1998 for comprehensive reviews). In other words, when most researchers engage in the selection of significant results, why should the single researcher handicap his or her hypothesis and analyses by not doing the same (for similar arguments: Mayer 1993: 142)?

14

15

16

Ethical codes meanwhile clearly forbid strategies promoting publication bias. See, for instance, the ethical guidelines of the American Statistical Association (ASA) ( 1 9 9 9 ) . Statistics confirm that there are only very few replication studies in social and behavioral sciences (Neuliep/Crandall 1 9 9 0 ; Weiss/Wagner 2 0 0 8 ; for statistics for economics: Mayer 1 9 9 3 : 1 4 6 ) . Social dilemmas are defined by t w o features: each individual has an incentive to choose an action not confirming to social cooperation; and all individuals are better off when cooperating instead of defecting (Dawes 1 9 8 0 : 6 9 ) .

What Fuels Publication Bias? · 647

More concretely, the strategic structure concerning authors and editors fulfills each of the three prerequisites defining a multi-person prisoner's dilemma (Rapoport 1998). Each actor has two options: to fulfill the cooperative norms by treating all research results in an objective manner or to give undue preference to significant results. Each actor has an incentive to use the second, deficient option. As a result, the dominant outcome is a deficient equilibrium point (all are worse off than if all had chosen to cooperate - this applies to the scientific community as well as to the individual scientists). In terms of the behavior of reviewers, the situation mostly resembles a public good game: the overall outcome would have been better if all the actors had invested more resources (i.e. reviewing time) in the public good of high quality research, since this would increase the reputation of the journal and accordingly the reputation of reviewers; however, each anonymous reviewer is better off when investing comparatively little time in reviewing others' work to increase the time available for his or her own research (see Alchian/Demsetz 1972; for public good problems within work teams: Ledyard 1995). Moreover, there are sound reasons to assume that publication bias occurs much more frequently than deliberate falsifications or fabrications of data. The scientific community mainly expresses moral indignation and punishes obvious misconduct in cases of the misreporting or falsification of data. Careless or deceptive use of statistics is much less in focus (Feigenbaum/Levy 1996: 2; Gardenier/Resnik 2002). While the detection of fraud can damage a researcher's reputation, the detection of misleading statistics (or other biased analyses) will in general only undermine the acceptance of a single paper or publication. Paradoxically, the application of more serious measures to combat fraud has the unintended effect of encouraging more subtle methods of achieving the preferred outcomes. While the falsification of data can often be clearly proved, the picking of statistical models or selective reporting of results is much more difficult to detect. Because the costs of being punished and the risks of being detected differ so greatly for fraud and biasing results, the dominant strategy clearly consists of methods that result in publication bias but fall short of data falsification (Feigenbaum/Levy 1996; Mayer 1993). Technical innovations, like the increasing computational power of statistical software programs, have further enlarged the possibilities for creating the desired results. Hypotheses concerning conditions that foster publication bias are straightforward. First, it is to be expected that the amount of publication bias will depend on the competitiveness within academic environments. The tougher the competition for journal space is, the higher the incentives for publication bias are. This applies all the more since reviewers and editors rely especially in high-competitive situations on quality-proxies (Auspurg et al. 2008). Empirically, it is therefore to be expected that the extent of publication bias is larger in high-impact journals with high levels of rejection. Similarly, publication bias should be more prevalent in international rather than in national journals. In addition, the risks of publication bias may well have increased over time. Both the pressure to publish and rejection rates for manuscripts have increased (Hojat et al. 2003). Therefore, and also because of better technical possibilities for tweaking appealing results, the likelihood of publication bias is higher than in prior decades. Furthermore, the incidence rates should be inversely proportional to the number of hypotheses to be tested. When testing many different hypotheses within a single manuscript, the chance of being accepted for publication does not depend on one central effect. In addition, it is more complicated to generate significant results throughout than it is when testing a single hypothesis within a single data analysis (Hunter/Schmidt 2004:496 et seq.). Similarly, the risk of publication bias should be higher with hypotheses that are

648 · Katrin Auspurg and Thomas Hinz

explicitly (not implicitly) stated (that is, if they are explicitly named as hypotheses, marked by common abbreviations like " H i " , "H2", or italicized or placed in bold in paragraphs). Results referring to such explicit statements of what the authors wish to test have a higher visibility, and their confirmation is more crucial for evaluating the quality of research during peer review. Thus, explicit hypotheses are expected to be more prone to publication bias than implicit ones. A corresponding assumption is that the greater the extent of publication bias is, the higher the paradigmatic unity in a research field: "Where theory allows all positive results, we should find the empirical literature to be relatively free of selectivity" (Docouliagos/Stanley 2009: 4). As a consequence, one could generally expect lower incidence rates of publication bias in the social rather than natural sciences, since the former are typically characterized by a comparatively small standardization of theories and research methods. However, there are also variations within the natural sciences. The focus on null-hypotheses significance testing to evaluate results is especially widespread in bio-medicine, while other disciplines such as physics rarely rely on this practice to detect the "true theory" (Gigerenzer et al. 1989: 211). These considerations clearly reinforce the argument that global hypotheses are not particularly instructive; it seems more promising to focus on the incentive structures of actors that are probably influenced by more subtle context factors than large research disciplines that integrate diverse research areas and different conventions for empirical work. Hypotheses on the relationship with the career status of authors are more clear-cut. Junior researchers have less reputation to risk than senior researchers and therefore fear costs less if biased analyses are detected. At the same time, researchers - especially those in mid-career - are in need of frequent publication to progress their academic careers. While entrance into the system of science is relatively easy, survival depends on reaching a critical amount of publications within a certain time period (Stephan 1996:1224). The length and impact of publication lists might determine who is successful in gaining a professorship. According to the thesis of cumulative advantage in science (Merton 1968, 1988), there is less need for effort to maintain a career that is already well-established. The high general trust in the work of researchers with a high reputation might relieve them from the urgent need of finding significant results in order to increase their chances of publication. The same prediction can be made from theories dealing with incomplete information (theories of signaling and statistical discrimination). Ideas from the sociology of science furthermore suggest that lower and middle-ranked actors especially have to solidify their social standing by demonstrating conformity with strong paradigms (Phillips/Zuckerman 2011; Leahey 2005: 8). 1 7 All in all, the lowest level of publication bias is therefore expected for researchers with permanent employment (like professors) and high reputation, while the highest will occur with scientists in mid-career. In other words, we expect an inverted U-shaped relationship between career age and incidence rates of publication bias. It is also necessary to note that scientists will always try to prevent the detection of biased results by holding some information on their work private or by withholding the data needed for replications (Stephan 1996). Incentives for misusing statistical analyses should increase in proportion to the cost of replications for other authors. Publication 17

In addition, the authors with the highest reputations are probably the ones w h o are most likely to be the subject of replication studies, since work disproving the results of someone with a " n a m e " is more likely to be published than work showing that an unknown researcher's results are wrong.

What Fuels Publication Bias? · 649

bias can be concealed behind unavailable data or imprecise descriptions of research work. Moreover, since "being the first" and priority of recognition are especially rewarded in the scientific community, the few replication studies that are published will suffer from a special kind of publication bias - a bias towards unsuccessful replications of the original work (Feigenbaum/Levy 1993). Consequently, one has to assume that publication bias more likely occurs when authors are working with "private" data, when there is a high flexibility in which statistical methods or operationalizations of theoretical concepts to use, and when the style guides of journals do not require detailed descriptions of the statistical methods and databases. These assumptions seem especially helpful for deriving effective interventions. Finally, for similar reasons, it seems likely that the incidence of HARKing or misuse of statistical analyses is inversely proportional to the number of people involved in these processes. Researchers are likely to be aware that practices to push appealing results violate norms of good scientific conduct (see for instance, the guidelines of the ASA, 1999) and they will therefore take steps to avoid detection. Yet the probability of deviant behavior being detected is likely to be proportional to the numbers of people involved. It is unlikely that large groups of researchers all share the same deviant norms and all trust each other to keep their improper behavior a secret. In consequence, the manuscripts most prone to publication bias should be those originating from single authors. 2.2 Empirical strategies for testing causal assumptions

The results referred to in section 1.3 are mainly in line with these theoretical predictions. But, as already indicated, the research so far suffers from testing only single hypotheses. More profound analysis is needed that explicitly focuses on causal mechanisms. However, testing causal assumptions is complicated. In section 1.3, it was argued that common detection methods are not suited for testing causal assumptions. Meta-analyses rely on the assumption that the data stem from one population and are all measuring the same, or at least similar, empirical effects. In addition, they do not allow a standardized comparison of publication bias across different research areas. Therewith they exclude the possibility of testing for the influence of different research environments and disciplines. Common methods beyond meta-analyses, like the comparison of proportions of significant results across different journals or research fields, are better suited for testing risk factors. But their results all suffer from the shortcoming of being confounded with aspects of quality. Higher numbers of significant results do not necessarily stem from publication bias but can rather indicate better theories, more precise research methods or a higher statistical power for the confirmation of hypotheses. Fortunately, there is another method that allows a relatively unambiguous detection of the manipulation or picking of significant outcomes and is not restricted to single research fields. This is the method originally developed by Edward Tufte in an unpublished manuscript and retrieved by Alan S. Gerber and Neil Malhotra, who called it the "caliper test" (see Gerber/Malhotra 2008a, b for details). The caliper test ("caliper" refers to a bandwidth around the "thresholds of significance") is based on the intuitively appealing idea of checking if there is an anomalous large number of test values just exceeding the threshold for statistical significance. The data used for the test consist in test values of significance tests reported in original studies (e. g. the z- or t-values of regression coefficients published in social sciences journals). The test compares the number of observations in equal-sized intervals just above and below the critical threshold for signifi-

650 · Katrin Auspurg and Thomas Hinz

cance. 18 Under the null-hypotheses that what is published is a random draw from the sampling distribution of all research, and in cases of small comparison intervals, the number of k observations just above the critical threshold should approximately follow a binomial distribution with a probability of 0.5 [i.e. k ~ B(N, 0.5)]. To state this more simply: the observations should be approximately distributed like the results of a coin flip; about half of them should be located over and about half of them under the threshold for statistical significance. 19 Even if the caliper test is not applicable to single articles, it allows diagnostics for groups of articles not bound to special research areas. Figure 1 displays the results obtained for articles published in prominent American sociology journals (Gerber/Malhotra 2008b). Observations just above the critical z-value (1.96, marked by the dashed line) constitute the maximum value of the frequency distribution of z-values and there is a clear spike in the number of z-scores just above this threshold. There is another clear pattern indicating publication bias: the frequency appears to be low just under the critical value, suggesting that some observations have been omitted or shifted above the critical value. Caliper test statistics are presented to the right of the figure. Using a 5-percent caliper 20 , 44 observations fall into the "over" caliper and 12 into the "under" (see the first line). Similar imbalanced patterns are found for larger bandwidths. Tests for binomial distributions verify that these asymmetric patterns rarely occur by chance: the probability that this pattern would occur due to chance is less than 1 in 10 million (Gerber/Malhotra 2008b: 3; see the p-values indicated in the column to the right). Since the sharp line between "significant" and "non-significant" results is only a social convention there are unlikely to be any alternative explanations for the high number of significant results except deliberate selections, including the suppression of "non-significant" results or manipulations of data analyses. Minor modifications to the test also allow detecting bias towards the expected sign of coefficients (Fung without year). 21 One shortcoming is the lack of an objective specification of the bandwidth of comparison intervals. To avoid subjective interpretations (which may even be prone to a publication bias themselves), the robustness of results against alternative specifications has to be checked. Another disadvantage is the relatively high effort involved. To establish clear results, only the test values associated with the core hypotheses should be included in the test, meaning that the original studies have to be carefully examined to identify the most 18

19

20

21

For example, in the case of regression coefficients and a two-sided significance testing with a 5-percent level, the numbers of absolute t-scores just below and above the critical value of 1.96 are compared. The k observations are exactly binomially distributed with ρ = 0.5 if the real test value and the critical test value converge. The more the real t-score deviates from the critical value (that is, the value of 1.96 in the case of a two-sided test with 5-percent level), the more the expected distribution becomes asymmetric, meaning that the distribution of the observations under the nullhypothesis increasingly deviates from ρ = 0.5. But it can be shown that strong deviations are very unlikely and that, in the case of small bandwidths, the caliper test is very robust against violations of model assumptions (Gerber/Malhotra 2008b for more details). Employing a x-percent caliper means that the numbers of t-statistics which fall within the interval of [ζ; ζ - (x-percent)z] are compared to the number of t-statistics falling into the interval [ζ; ζ + (x-percent)z]. The basic concept to be proved here is that there are non-significant findings fitting the hypotheses that exceed those that run in the opposite direction. Additionally, it is possible to check if an unusually high number of observations rejecting the hypotheses just fail to reach statistical significance (see again Fung without year for such an example).

What Fuels Publication Bias? · 651

454035"

Over caliper

Under caliper

p-va!ue

44 73 115

12 33 42

0)can be obtained directly from the proportion of bootstrap replications higher than the original estimate θ Ρ(ρβτο) = ΡπΜΘ>θ)

= \-(*

Θ Ί

ζ

) < θ

)

(5)

Using the percentile interval method, we can achieve probability values for the original chi-square values of the interviewer clusters that are independent of the size of the interviewer clusters. These probabilities reflect the plausibility of the fit to Benford, independent of the number of digits in the cluster. Our hypothesis is that cheating interviewers will have very low probabilities. Hence, it might be useful to construct interviewer rankings by means of plausibility values. 5.2 Fit in interviewer clusters of sample A/B The scatterplots in Figures 6 - 7 show the fit to Benford for the first digit and first two digit distribution in each interviewer cluster in samples A/B.8 The chi-square values of the clusters with detected fabricated interviews are marked with black circles. We can see that one of the four fabricated clusters has the worst fit to Benford and appears as an outlier in the case of the first digit distribution. In the first two digit distribution, three of the marked clusters have very high fit values. Figures 8 and 9 on page 700 show the density distribution 9 of the probability P{perc) in samples A/B, wave 1 for the first digit and first two digit distribution (normal density dotted line). If all interviewers are free from suspicion, P(perc) would only have values above 0.5 and the density function would ideally have a peak around P(perc) = 1.0. In our case, the highest density occurs at P(perc) = 0.94. We can also see in the low probability region at P(perc) = 0.1 a local maximum. This means that there are a number of clusters with very low plausible fit values. One reason might be that these interviewers work in quite homogeneous sample points and/or that some of these interviewers fabricate their assignment and fail Benford's Law. Table 1 on page 701 shows the interviewer ranking by the probability P(perc) of each cluster for waves 1 to 3, sample A/B. We can see that the fabricated cluster of the 8

All scatterplots in this paper have been created using the software program T D A (Rohwer/Pötter

2005). 9

We again use a kernel density estimation with an Epanechnikov kernel.

Benford's Law as an Instrument for Fraud Detection in Surveys Using the Data · 699

90 η

CHI SQUARE VALUES IN SAMPLE A/B 1984

80-

first digit of 23 monetary variables/questionnaire

70-

(clusters with fakes are marked)

6050-



+

40-

+

y

30ν

20-

4

+

+

+ +

+

* a

k j g

+

+•

·

10+

00

+ +

+

++

50

100

1

1 150

1

1 200

+

1

++

r~

τ·"

250

* ι 300

350

400

variables in interviewercluster Figure 6 First digit distribution: chi-square values for interviewer cluster in wave 1, sample A/B 950 900850800750700 650 600550500 450400350300 250

CHI SQUARE VALUES IN SAMPLE A/B 1984 first and second digit of 23 monetary variables/questionnaire (clusters with fakes are marked)

+

200 150 100 50 0

1—«—r 100 150

ΊΓ - r "Γ T 200 250

T 300

350

I 400

variables in interviewercluster Figure 7 First two digit distribution: chi-square values for interviewer cluster in wave 1, sample A/B interviewer already identified as cheating, no. xx827x with 122 digits, has the lowest probability P(perc) = 0.002 of all interviewers in wave 1. Overall, we find six additional interviewers who have probabilities below the 5 % level. Of course, this is not a sure indication that these clusters are fabricated but low plausibilities for the chi-square values realized could be the result of cheating and the fieldwork organization can use this information to re-contact households in suspicious interviewer clusters.

700 · Jörg-Peter Schräpler

Distribution of the plausibility of the fit in clusters Sample A/B; wave 1

Figure 8 Distribution of the probability P(perc), sample A/B, wave 1, first digits Distribution of the plausibility of the fit in clusters Sample A/B; wave 1 ; first-two digits

Figure 9 Distribution of the probability P(perc), sample A/B, wave 1, first two digits Unfortunately, the other two fakes evident in wave 1 could not be identified with the first digit distribution. The cheating interviewer no. xx800x ranks 61 (P(perc) = 0.265) and interviewer no. xx937x even has a really high plausibility of 0.958 and rank 420 (not shown in the table). Nevertheless, if we use the first two digit Benford distribution, we will find three of four cheating interviewers in the top 12 of the ranking list, shown in Schräpler (2010). This indicates that, in some cases, the first two digit distribution is more successful.

Benford's Law as an Instrument for Fraud Detection in Surveys Using the Data · 701

oooooo o o rv ooo0 o0IV oc ooo o ο ο ο ο Ο Ο ο ο Ο ο ο Ο ο IV ο Οο ΟΟΟ Ο IV o 00Ν •>tVOo o\m0 0oo00 10τ—Γ 1Μ 0Γ σι 0Μ 0Γιη Μm ιηιη (Λ ιη ιηmIV ιη10 m 00 V0ΓΜ m m-ί-in o in Lfi 00σ\m ΜΓ kO• ί- ΙΟ ιο Γν 10σ\ mΓΜ5 ΓΜ•t ΙΟ σι 8888< o o o o δ o ο o o o o o o o o τ-τ-τ-Τ-τ-Τ-τ- τ-τ-τ-τ-Τ- ΓΜΓΜΓΜΓΜΓΜΓΜΓΜ m o o 6 o d ò o ò o o Ö Ò o ci Ο Ο Ο Ο Ο Ο Ο ο Ο Ο Ο Ο Ο ο ο Ö ο Ο Ο Ò Ò

ο

r Φ m if> t· " (Ν· • xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx x x

88888^

.....

... .

ÎOOOOOOOOOOOOOÇOOO'-rrrrrfNfN 50000000000000000000000000

rfr o

obòoooòòòòoòòodòòòdòòòòòòòòòòòòddooòòòòò ö

χχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχχ ou^inT-ONin^voNinrvrt^Tt^mr^fvt-rNirvoooooíTir^ceo^^vDmNvo x xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx x

OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

©000000000000000000000 cívooórn^cóiriuSfNÍaivb^aÍ^o^cóocór^oÑ» T- VO • OV V ·O·N

ì ' xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx ^οο^σοΓΝίο^οοο^οοιτιΐηοιηιηοιηνο^οοοοοΓ^ΓοΟΟνοο^ x xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx xx x

702 · Jörg-Peter Schräpler

5.3 Fit in interviewer clusters of sample C

We have shown in Figure 3 on page 694 that Benford's Law doesn't hold in wave 1 in the East German sample C. We have found a strong disproportion of the lower digits, probably caused by homogeneous clusters with quite low monetary values. The homogeneity in the data is attributed to the living conditions in East Germany in the year 1989. If the overall fit in the sample is worst, we can reasonably assume that the fit for most clusters will be worst too. Leading on from this, in Figure 10 we find rather high chiCHI SQUARE VALUES IN SAMPLE C 1990

140 130 ·

first digit of 34 monetary variables/questionnaire

120 ·

110100· 9080 70 60· 50· 403020· 100-

-I 0

1 50

1

1 i 1 1 1 1 1 1 1 100 150 200 250 300 v a r i a b l e s in i n t e r v i e w e r c l u s t e r

1

1 350

1

1 400

Figure 10 First digit distribution: chi-square values for interviewer cluster in sample C w a v e 1 CHI SQUARE VALUES IN SAMPLE C 1990

130012001100-

first and second digit of 34 monetary variables/questionnaire

1000900800-

700600500400-

+ + +t.

3002001000-

_1 50

,

! , 1 1 1 1 1 , 1 100 150 200 250 300 v a r i a b l e s in i n t e r v i e w e r c l u s t e r

1

[— 350

Figure 11 First t w o digit distribution: chi-square values for interviewer cluster in sample C w a v e 1

400

Benford's Law as an Instrument for Fraud Detection in Surveys Using the Data · 703

square values for clusters in wave 1 of sample C (max. chi-sq. = 112.8; digits = 99). Despite all this, the fieldwork organization was unable to identify any cheating interviewers in this sample. Figures 12 and 13 show the density distribution of the probability P(perc) in sample C, wave 1 (normal density dotted line) for the first digit and first two digit distributions. The shape of the first digit density function is completely different from Figure 8 on page 700. We find the highest density around 0.1 and a local maximum at 0.65. A naive interpretaDistribution of the plausibility of the fit in cluster Sample C; wave 1

plausibility Figure 12 Distribution of the probability P(perc), sample C, wave 1, first digits Distribution of the plausibility of the fit in clusters Sample C; wave 1 ; first-two-digits

plausibility Figure 13 Distribution of the probability P(perc), sample C, wave 1, first two digits

704 · Jörg-Peter Schräpler

>*-CMININ^INVOMROINOORN^*-^OORSMRN.CO^ORN^VORN.RSINOOINOINOORN^'VO O O Ö O O O O O O O O O O O O O O O O O O O R R R R R R N N F N I N M M M F N

ooooooooooooooooooooooooooooooooooooo ©doododdddoodododdoddddddddoodddddddd O V 0σ ι τ-ι η Ννο h»m V Oi n ΓΝΝσ ι fs.0 0 00 0 ιη m CS r x * os ΓΝ 0 0νο 0 0ΓΝ Γ-! νο νο Κ νο (Ν ο ο PO

Γ ΜΓΝ 0 0 Γ Μ

Γ Ν in νο m r v o c o m σ\ en Κ ο ο ο If) σ\ PÒ vo

00 V O

V O τ- CS hv 0 0 0 0Κ

r sj PO i n ΓΝ οο T— CO i n rnT-rN.oooococoin*-orN.u^r>jcom ooooooooooooooooorrroooo(NrOr(NiNfNmmm'îfNifimmt 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 dddddddddddddddddddddddddddddddddddddd

< N SP i-o XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ χ

oooirioiriiriinoooooiriooinuìiooinin^oinifioininin^moooomtnino οοοοοοοοοοοοοοοοοοοοοοοοοοοοοοοοοοοοο©*-*00000000000000000000000000000000000000

o > ."K > M ï '

-^rsm^oœoooo(rjrvj^o(N./«). P(Dc

= dc) = yz 2 ^ L P ( D i = dc). —i η ι=1

(3)

k is the maximum of the available significant digits. D, is the ¿-th significant digit, D c stands for the combined digits, «d, is the belonging number of cases and η represents the sum of Η DÌ· According to equation (3), our data in Table 2 results in a mixture of the first four significant digits, containing η - 5,481 cases. Those digits include all available values in the data. The combined Benford distribution is calculated by Τ IQf) P(Dc = d c ) = - ^ +

ñ

ι 77Q

P(Di=

5,481

d c ) + í : W l

P

{

D

2

=

dc)

(4)

5,481 l^P(Di=dc)+^üTP(D4=dc)

for all possible values of digits {dc = 0, 1 ... 9). Figure 2 shows the underlying distributions. The deviation from Benford's law is obviously very small - an impression which is supported by the ^ 2 -goodness-of-fit-test which shows no significant deviation between combined regression digits and the combined Benford distribution (χ2 = 7.35, ρ = 0.60, df = 9). 25

0

1

2

3

4

5

6

7

8

9

combined regression •

7.13 18.59 13.67 10.29 9.85

9.40 8.30 7.66 7.68 7.43

combined benford •

6.70 18.48 13.33 11.14 9.89

9.06 8.46 8.00 7.63 7.32

Figure 2 Comparison of the combined regression digit distribution and the combined Benford distribution (n = 5,481 ; data source: Kölner Zeitschrift für Soziologie und Sozialpsychologie 1985-2007)

Difficulties Detecting Fraud? The Use of Benford's Law on Regression Tables · 739

4

Checking experimental regression data

According to these results, regression digits, as well as combined regression digits are Benford distributed. Two more requirements need to be fulfilled to detect fraud. First, the digits distribution of fraudulent coefficients must not follow Benford's law. Second, the difference between the Benford distribution and the fake distribution should be large enough to be detected by a significance test. The smaller the difference, the more coefficients are needed to distinguish real and fake regressions. The more coefficients needed the less effective the method becomes. In this section we examine whether regression fraud differs from the significant digit distribution. To get genuinely fraudulent data we conducted an experiment with 47 students, participating in a multivariate regression course at Ludwig-Maximilians-Universität München. They were asked to forge regression coefficients with four significant digits. 4 The regressions had to support the thesis: The higher the education, the fewer cigarettes are consumed by smokers per day. Education was divided into three ascending categories with low education as reference. Besides that, there were several other influences to fabricate. 5 All in all, the students had to forge ten unstandardized regression coefficients and they were asked to repeat this ten times. Every student was to create 100 coefficients; altogether there were 4,621 for all 47 persons. One student forged too few fourth significant digits to perform reliable tests - these fourth digits were excluded from the analysis. In the following we always assume that the distribution of our student counterfeits descends from the distribution of real scientific falsifiers. This point might be questionable, but getting enough real forged regression data is nearly impossible. Still, if the distribution of student forgers does not deviate much from the distribution of real forgers, the results show a tendency towards the effectiveness of Benford's law. Table 3 reports the findings of the / 2 -goodness-of-fit-test on the aggregated data of all students. The results show that every significant digit deviates statistically significantly from Benford's law. Consequently we can assume that the digit distribution of our fraudsters is not Benford distributed. The lower ^-values at the third and fourth digit also give indications supporting Diekmann's thesis to analyze high order digits. Even the combined digits are associated with a smaller / 2 -value than the fourth one. But given that there were only few fourth significant digits in the Kölner Zeitschrift für Soziologie und Sozialpsychologie (18.5 percent), in practice, it might be easier getting enough data using the combined method. However, this point requires further analysis. It is possible that aggregating and combining many different fraudulent distributions decreases the deviation from Benford's law. Therefore, we ran goodness-of-fit-tests for each individual student and counted the identified forgers. This was done for single and combined significant digits. As a result of our experimental specification, we had about 100 individual regression coefficients per person to investigate. With this reduced

4

5

The students participated in an advanced course of a Masters program. They had profound knowledge in multiple linear regression analyses as well as in the particular topic. Therefore, we can safely assume that they were capable generating plausible and „real" fraudulent data. Those were marriage, divorce, unemployment, net income, age, squared age, place of residence, divided into city and town as well as the question whether there are children younger than fifteen living in the household.

740 · Johannes Bauer and Jochen Gross

Table 3 / 2 -goodness-of-fit-test for matching fraudulent data and Benford's law (df = 8 for the first digit, df= 9 for combined, second and later digits; data source: counterfeited student data) 1 st significant digit (n = 4,621) 2 nd significant digit (n = 4,541) 3 rd significant digit (n = 4,378) 4 th significant digit (n = 3,866) Combined digits (n = 17,406)

* 2 = 103.39 122.59 304.90 620.59 596.85

Ρ= Ρ= Ρ= Ρ= Ρ=

0.00 0.00 0.00 0.00 0.00

Table 4 Sum of the identified forgers with a / 2 -goodness-of-fittest (n = 46 for the fourth digit and η = 47 for all others; data source: counterfeited student data)

1 st significant digit 2 nd significant digit 3 rd significant digit 4 th significant digit Combined digits

absolute

percentage

35 37 42 41 47

74.7 79.7 89.3 89.1 100.0

number of cases, we expected that correct fraud identification might not always be possible. Table 4 shows the sum of all absolute and relative realized deviations from Benford's law. The table gives evidence that at an individual level, higher digits and especially combined digits improve the fraud detection. Whereas on the first digit we were able to detect about 75 percent of the forged regressions, the third and fourth digit enabled about 89 percent identifications. Analyzing the digits altogether (about 400 per person), the combined method allowed detection of all 47 falsifiers.

5

Simulation of required cases to detect fraud

According to the evidence presented, fraudulent data does not follow the significant digit distribution. However, to establish an efficient instrument to detect fraud, it is necessary to consider the required number of cases. To estimate the quantity of needed digits, a Monte Carlo simulation was conducted. After an explanation of our simulation method based on aggregated data we will analyze individual fraud. 5.1 Aggregated fraud To begin, several data sets with a different numbers of cases were created. Those new data sets contain random chosen digits, taken from the aggregated fabricated regression coefficients. This results in digit distributions for each data set, i.e. for many different quantities of fabricated digits. These enable us to estimate at what data amount a χ2goodness-of-fit-test would very likely discover a significant deviation from Benford's law. The following is an exemplary illustration for the first significant digit at aggregate data. Table 5 represents the distribution of the first significant digit.

Difficulties Detecting Fraud? The Use of Benford's Law on Regression Tables · 741 Table 5 Empirical distribution of forged first significant digits (data source: generated values based on the counterfeited student data) 1 percentage

2

30.9

17.3

3

13.5

first significant digit 4 5 6 8.0

7.2

4.8

7 5.3

8

9

6.1

6.7

Table 6 Recognized deviations depending on the number of first significant digits (data source: generated values based on the counterfeited student data)

deviation

20

50

100

0

0

0

generated number of cases 250 400 500 750 0

1

0

0

1,000

1,250

1,500

1

1

1

A data set was generated by randomly choosing a specific number of cases from the empirical distribution of digits. 6 This random process results in a simulated distribution, representing the frequency of fraudulent digit values for a specific number of cases. Based on such a distribution, the goodness-of-fit for Benford's law can be analyzed by a / 2 -test. Consequently, if there are not enough data to discover fraud, the test will not show a significant deviation. This procedure was repeated for increasing numbers of randomly chosen digits, giving a list of significant and non-significant deviations. If the result was significant the value 1 was attributed, non-significant outcomes got the value 0, and the result was assigned to its analyzed number of cases. The respective binary vector indicates if we are able to discover fraud when a specific amount of data is available. Table 6 exemplarily shows such a vector. In this example, first indications for fraud were discovered at 400 digits. We cannot presume to detect a fraudster with a high probability, because at 500 and 750 data points no significant deviations have been shown. For 1,000 and more cases fraud was constantly spotted. The example shows results for ten / 2 -tests, however we actually repeated those simulations 1,000 times. Due to the random selection of digits, significant results are not necessarily followed by significant results for higher number of cases. To reduce these insecurities we ran a logistic regression model and estimated the number of needed digits in order to detect fraud with a probability of 95 percent. The logistic curve in Figure 3 plots the expected probability of discovering fraud for a specific number of observations. The necessary quantity of digits to detect fraud with a probability of 95 percent for the first significant digit equates the point of intersection of the horizontally and vertically dotted line. 1,045 cases were needed in our first data set. It has to be mentioned that this simulation is based on a strong assumption. The logistic regression is fitted on 100 percent fraudulent digits. In practice regression coefficients are collected from different persons. Therefore, we assume that in real data the proportion of 6

For the combined method the weights — ... — for P(D¡ = dc)-.-P(D4 = d¿) were not chosen accordai η ing to the percentage of first, second, third and fourth significant digits in the data, because real values have considerably fewer higher digits. Instead the weights were according to the percentage of significant digits in the Kölner Zeitschrift für Soziologie und Sozialpsychologie for the time period under consideration.

742 · Johannes Bauer and Jochen Gross

number of cases Figure 3 Logistic regression for needed first significant digits to detect fraud (aggregated data; data source: generated values based on the counterfeited student data)

fraud is far lower. To accommodate our estimation to a more realistic situation, we added different amounts of non-faked data. In this case, having taken random digits only a percentage descended from our fabricated empirical distribution. The remaining share was chosen from the significant digit distribution. Based on this mixture of Benford distributed and fraudulent data, the / 2 -goodness-of-fit-test examined whether fraud was detected. Logistic regressions were fitted to the partly fabricated data. This procedure was conducted for 1 0 0 , 95 ... 5 , 1 percent fraud. The logistic regressions estimate the required amount of digits for a specific percentage of fraud. To increase the accuracy, the simulation was repeated 4 0 times. Altogether, it resulted in 8 0 0 logistic estimations for single and combined significant digits, containing the needed numbers of digits to identify deceptive data with a probability of 9 5 percent for different percentages of fraud. On this basis we fitted linear regressions to get an empirical function for the required quantities subject to proportion of fabrication in the data. Table 7 contains the constants and slope coefficients based on the formula.

Table 7 Linear regression for required amount of significant digits to detect fraud with a probability of 95 percent and different proportion of fraudulent aggregated data (data source: generated values based on the counterfeited student data)

1 st significant digit 2 nd significant digit 3 rd significant digit 4th significant digit Combined digits

constant

slope coefficient

R2

-210 -400 111 62 -378

1,208 1,242 297 136 1,216

0.935 0.898 0.897 0.954 0.990

Note: All constants and slope coefficients have got a p-value of 0.00

Difficulties Detecting Fraud? The Use of Benford's Law on Regression Tables · 743

Table 8 Estimated number of aggregated digits to detect fraud (data source: generated values based on the counterfeited student data)

st

1 significant digit 2 nd significant digit 3rd significant digit 4th significant digit Combined digits

100%

75%

Proportion of fraud 50% 25%

998 842 408 198 838

1,788 1,512 639 303 1,722

3,853 2,112 1,299 606 4,160

.. , , , , ,. . Needed number of digits = constant -\

13,456 9,536 4,863 2,238 16,559

10%

5%

67,720 38,875 29,811 13,662 96,212

228,281 110,687 118,911 54,462 360,109

slope coefficient {proportion of fraudulent

data)



... (5)

where i the power of the proportion of fraudulent data is 1.75 for the first, 1.5 for the second and 2 for the third and fourth significant digit. At the combined digits / corresponds to 1.9. The range of values for the proportion is [0.01, 1]. Table 8 shows the amount of required digits to detect counterfeit in regressions, given different proportions of fraud. First, the numbers rise when the proportion of fabricated data decreases. The less the proportion of fraud, the more difficult fraud detection becomes. Second, there are fewer digits needed on the third and fourth significant digit. A problem would occur if we assign these results to real data. Many regressions do not report a third or fourth significant digit. Few required higher significant digits do not necessarily mean that there are few coefficients to collect. If a regression reports no fourth significant digit, we have to gather another one. To get 100 fourth significant digits in the Kölner Zeitschrift für Soziologie und Sozialpsychologie, about 540 coefficients have to be collected. To enable a more realistic comparison concerning the effort of data collection, we estimate the amount of coefficients and not the number of significant digits. This means that, in calculating the required coefficients, the needed amount of digits is weighted by the proportion of significant digits in real data. 81.6 percent of the regression coefficients from the Kölner Zeitschrift für Soziologie und Sozialpsychologie have a second significant digit, 51.3 percent have a third and 18.5 percent a fourth significant digit. If 100 second significant digits are needed, we expect that 100/ 0.816 = 123 coefficients have to be collected. For the combined digits a different approach is necessary. Due to the fact that all coefficients have a first significant digit, the required amount of digits is weighted by the proportion of first significant digits in the analyzed articles. There were 5,481 digits in the non fraudulent regressions containing 2,180 first significant digits, which is 38.7 percent. Hence, on the average 100 combined digits are taken from 100 · 0.387 = 39 coefficients. Figure 4 shows the required amount of unstandardized regression coefficients for single and combined significant digits. We want to stress three results. First, looking at the single significant digits, digits with high orders give better evidence of fraud, i. e. the third significant digit. For example, having 50 percent fraud, we need 3,853 coefficients when analyzing the first significant digit and 2,532 are needed when looking at the third significant digit. Second, there is an advantage in using the combined digits. For 50 percent fraud, the combined method requires 1,135 regression coefficients. This is about 1,400 coefficients fewer than the third digit. Using a combination of digits can reduce the

744 · Johannes Bauer and Jochen Gross

1st significant digit

2 nd significant digit

3 rd significant digit

4 th significant digit

0

20

40

60

80

100

proportion of fraudulent data

combined digits

Figure 4 Required unstandaridized regression coefficients to detect fraud with a probability of 95 percent (aggregated data; data source: generated values based on the counterfeited student data) number of necessary cases 1.5- to 3-times. 7 Third, the lower the percentage of fraud, the more difficult it becomes to detect. If we have 100 percent fraudulent data, the most efficient method, namely the combined digits requires 379 regression coefficients. On average the Kölner Zeitschrift für Soziologie und Sozialpsychologie had 4 4 regression coefficients in each article containing regression tables. For 379 coefficients we expect that 9 articles have to be investigated. This is an amount of coefficients which can easily be gathered. For 50 percent fraud we need 25 articles. If there is only 10 percent fraud, at least 575 articles should be analyzed and for 5 percent fraud approximately 2 , 3 0 0 are necessary. Assuming that a value of 5 percent is closer to a realistic situation, we would need more than 100,000 regression coefficients. This result suggests that Benford's law is ineffective in investigating regression fraud with aggregated data, if the percentage of fraudulent data is small. 5.1.1 Individual fraud There is a possibility that aggregating data might make the investigation with Benford's law more difficult. If one fraudster avoids the digit 1 and another prefers this digit, the existing deviation decreases. This problem can be avoided looking at regression coefficients from single persons. Therefore, we simulated the required quantity of cases for individual fraud. The procedure is nearly the same as for the aggregated data. At first, a specific number of digits was randomly chosen, in this case from the distribution of one single student. The ^ 2 -goodness-of-fit-tests were used to check for the presence of significant deviations from Benford's law. These tests were conducted for all 47 student

7

This result changes if the percentage of fraud is very low. Having less than about 1 0 percent of fraudulent data, the second digit is more efficient at aggregated data, due to its lower power in the linear model. However this is caused by an estimation error of the regression. Looking at the original data, the combined digits still have the advantage.

Difficulties Detecting Fraud? The Use of Benford's Law on Regression Tables · 745

Table 9 Linear regression for required amount of significant digits to detect fraud with a probability of 95 percent by different proportions of fraudulent individual data (data source: generated values based on the counterfeited student data)

1st significant digit 2 nd significant digit 3rd significant digit 4 th significant digit Combined digits

constant

slope coefficient

R2

-23 -1 6 -3 -49

152 110 110 76 223

0.998 0.998 0.998 0.998 0.998

digit distributions. Therefore, for a specific quantity of data, we gained 47 results telling us if a fraud was or was not identified. We repeated this for different amounts of randomly chosen digits and tested the goodness-of-fit. According to the approach at the aggregated data, logistic regressions were fitted and the required number of cases calculated. The percentage of fraud was varied to estimate the amount of required digits to detect fraud with a probability of 95 percent. Lastly, linear regressions were adjusted to this data subject to proportion of fabrication. Table 9 shows the resulting estimation of the required digits to identify forged unstandarized regressions at the individual level, based on the formula. », j j number , ofr digits j- • = constant H Needed

slope coefficient (;proportion of fraudulent

data)

j ι

(6)

where the powers of the proportion of fraudulent data are 2 for single and combined digits. All slope coefficients, except for the constants of the second and higher significant digits, showed a significant deviation from zero. Due to the very high Revalues of 0.998, this problem is not pursued further. We used the previous given weights in the aggregated calculations and computed the number of required unstandardized regression coefficients at the individual level. Unfortunately, there was only a small number of digits available for each student. This posed a new problem. Having student-digit-distributions based on only 100 regression coefficients for single digits, that is, about 400 for combined digits, the simulation will become inaccurate for high numbers of cases. In particular, we expect it to be underestimated. In other words, if there are more than 100 or 400 fraudulent digits included in the estimated number of cases, the regression results should only be seen as rough guess. 8 To sum up, Table 10 shows benchmark values for given proportions of fraud, seperated according to individual and aggregated values. As the combined digits were more efficient, we present the number of cases based on their linear equation. Regarding the necessary number of coefficients, the individual data is more efficient. At 100 percent fraud, individual values need about one fifth of the required amount of data at the aggregated level. This rises up to one third at the 10 and 5 percent level. Moreover, This means at the first three significant digits, our estimation will be affected even at 100 percent fraud. For the fourth digit, this problem occurs at 7 4 . 4 percent and for the combined digits at 70.5 percent fraud.

746 · Johannes Bauer and Jochen Gross

Table 10 Estimated number of coefficients to detect fraud for combined significant digits Cdata source: generated values based on the counterfeited student data)

Aggregated data Individual data

100%

75%

Proportion of fraud 50% 25%

378 69

575 138

1,135 335

4,161 1,399

10%

5%

25,343 8,850

100,992 35,458

it is obvious that individual data provides much better evidence for regression fraud. The curves in Figure 5 show the required amount of individual unstandardized regression coefficients to identify fraud using Benford's law with a propability of 95 percent. As mentioned above, the most efficient method is the combination of significant digits. The curve runs constantly below the single significant digits. The hypothesis in favor of studying high orders holds only for individual digits, not however for individual coefficients. Due to the low amount of high order digits in real data, there are more coefficients needed on the fourth and third significant digit than on the first and second significant digit. Still, if the forger counterfeits few coefficients, the number of required observations rises. In other words, the probability of discovering fraud with Benford's law will decrease and the method becomes ineffective. Given that the analysis of individual data is more capable and due to the results of a recent study by Tödter (2009), which examines individual fraud, we want to address a potential problem. In short, Diekmann, Tödter as well as this study apply Benford's law on individual regressions in order to detect fraudulent data. Tödter also detected that 20 percent of the regression articles in Empírea and Applied Economics Letters did not follow the logarithmic distribution and he consequently characterized them as doubtful.

1 st significant digit

2 nd significant digit

3rd significant digit

4 th significant digit

combined digits Figure 5 Required unstandaridized regression coefficients to detect fraud with a probability of 95 percent (individual data; data source: generated values based on the counterfeited student data)

Difficulties Detecting Fraud? The Use of Benford's Law ori Regression Tables · 747

Another argumentation can be applied. As mentioned in chapter 3, there is no theoretical explanation for the emergence of Benford's law in regressions. Furthermore, to the knowledge of the authors, this conclusion solely relies on empirical observations collected from several articles. All studies examine a mixture of regressions from different scientists. Hence, the reasoning drawn from these results can only refer to aggregated regressions, not to individual data. In addition individual data might differ from Benford's law and could still not deviate from the logarithmic distribution if aggregated. A combination of significant digits taken from random distributions will result in Benford's law (Hill 1996). Therefore a combination of random regression articles could also result in Benford distributed aggregated values. This argumentation could be an explanation for this very high percentage of fraud Tödter detected and should be taken into consideration before using Benford's law to investigate individual fraud.

6

Discussion

Based on considerations to use Benford's law in detecting misconduct in sciences, we analyzed the capability of this instrument. First, we reproduced the findings of Diekmann (2007) w h o proposed that this technique could be applicable. Basically, we agree with this conclusion: regression coefficients follow a Benford distribution and significant digits of higher degree are more appropriate to detect misconduct. But the presented simulations question the fundamental aspect of capability to systematically detect. In this regard our results suggest that, given a low percentage of fraudulent data, this method needs further improvement if fraudulent scientific data should be detected. To recapitulate, we have derived three basic insights from our simulations. First, if we have more than one significant digit, the combination of those digits available should give a more reliable indication for fraud. In addition, we expect that the combination of digits could also be applied in other areas of Benford based fraud identification. Consequently, the use of the combined method is advisable. Second, the analysis of aggregated regression coefficients requires more data to discover fraud than the analysis of individual data. The stronger deviations from Benford's law on the individual level are compensated by the aggregation. Hence, the aggregated distribution is closer to Benford's law. Taking this into consideration, we propose to use individual data, if there's future evidence for a Benford distribution at non fraudulent individual regressions. Third, having only a small percentage of fraudulent data, the necessary number of coefficients becomes very high. In this last case, the method is nearly incapable of detecting fraud. Furthermore, if Benford's law should be applicable, the a- and ß-error of significance tests have to be taken into consideration. It is to be expected that the required number of cases will rise even further. Given the actual simulations, if low percentages of fraud can be expected the results do not suggest the use of the significant digit law to investigate regression fraud. Further, the assumption that correlation coefficients are Benford distributed bases solely on empirical evidence. Up to now, a theoretical explanation why this assumption should hold is missing. Our results show that even if the assumption holds, the tool does not necessarily lead to reliable conclusions. Though, further research should pick up the theoretical basis of the application Benford's law to fraud detection, due to the fact that our results raise serious questions but do not show that the idea of Benford based fraud detection is inapplicable in general. First, it might be that the distributions of the fraudulent regressions created by our students do not resemble the distribution of real scientific forgers. Second,

748 · Johannes Bauer and Jochen Gross

there are several other research fields which produce far more data and whose fraudsters might create larger deviations from the significant digit law. T h i r d , if m o r e r a n d o m processes such as the occurrence o f a specific digit distribution like Benford's law were t o be integrated in the investigation process, the high number o f required cases could be reduced.

References Benford, F. (1938), The law of anomalous numbers. Proceedings of the American Philosophical Society 78: 5 5 1 - 5 7 2 . Broad, W., N. Wade (1984), Betrug und Täuschung in der Wissenschaft. Birkhäuser Verlag, Berlin. Diekmann, A. (2007), Not the first digit! Using Benford's law to detect fraudulent scientific data. Journal of Applied Statistics 34: 3 2 1 - 3 2 9 . Fröhlich, G. (2003), Visuelles in der wissenschaftlichen Kommunikation — z.B. Betrug und Fälschung. European Journal for Semiotic Studies 15: 6 2 7 - 6 5 5 . Fröhlich, G. (2006), Plagiate und unethische Autorenschaften. Information Wissenschaft Sc Praxis 57: 8 1 - 8 9 . Giles, D . E . (2007), Benford's law and naturally occurring prices in certain ebay auctions. Applied Economics Letters 14: 157—161. Hearnshaw, L.S. (1979), Cyril Burt, Psychologist. Cornell University Press, Ithaca, New York. Günnel, S., Κ. H. Tödter (2009), Does Benford's Law hold in economic research and forecasting? Empirica 39: 2 7 3 - 2 9 2 . Hill, T.P. (1995), Base-invariance implies Benford's law. Proceedings of the American Philosophical Society 78: 5 5 1 - 5 7 2 . Hill, T.P. (1996), A statistical derivation of the significant-digit law. Statistical Science. 10: 354-363. Hüngerbühler, N. (2007), Benfords Gesetz über führende Ziffern: Wie die Mathematik Steuersündern das Fürchten lehrt. Available at: http://www.educ.ethz.ch/unt/um/mathe/ana/benford/Benford_Fuehrende_Ziffern.pdf Kuiper, N. H. (1962), Tests concerning random points on a circle. Proceedings of the Koninklijke Nederlandse Akademie von Wetenschappen A 63: 38—47. Newcomb, S. (1881), Note on the frequency of use of the different digits in natural numbers. American Journal of Mathematics 4: 39—40. Nigrini, M . (1992), The detection of income tax evasion through an analysis of digital distributions [Dissertation], University of Cincinnati. Nigrini, M . (1996), A taxpayer compliance application of Benford's law. Journal of the American Taxation Association 18: 72—91. Raimi, R. A. (1976), The first digit phenomenon. American Mathematical Monthly 83: 521-538. Smith, K. H., M . Rogers (1994), Effectiveness of subliminal messages in television commercials: Two experiments. Journal of Applied Psychology 79: 866—874. Tödter, Κ. Η. (2009), Benford's Law as an Indicator of Fraud in Economics. German Econometric Review 10: 3 3 9 - 3 5 1 . Wlodarski, J. (1971), Fibonacci and Lucas numbers tend to obey Benford's law. Fibonacci Quarterly 9: 8 7 - 8 8 . Dipl. Soz. Johannes Bauer, Ludwig-Maximilians-Universität München, Institut für Soziologie Konradstr. 6, 8 0 5 3 9 München, Germany. [email protected] Dr. Jochen Groß, Senior Quantitative Consultant, Roland Berger Strategy Consultants Holding GmbH, Mies-van-der-Rohe-Str. 6, 8 0 8 0 7 München, Germany.

Jahrbücher f. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011 ) Bd. (Vol.) 231/5+6

Plagiarism in Student Papers: Prevalence Estimates Using Special Techniques for Sensitive Questions By Elisabeth Couttst", Zurich, Ben Jann, Bern, Ivar Krumpal and Anatol-Fiete Näher, Leipzig" JEL A20; C81 ; C83 Plagiarism; sensitive questions; randomized response technique; item count technique; crosswise model.

Summary This article evaluates three different questioning techniques for measuring the prevalence of plagiarism in student papers: the randomized response technique (RRT), the item count technique (ICT), and the crosswise model (CM). In three independent experimental surveys with Swiss and German university students as subjects (two web surveys and a survey using paperand-pencil questionnaires in a classroom setting), each of the three techniques is compared to direct questioning and evaluated based on the "more-is-better" assumption. According to our results the R R T and the ICT failed to reduce social desirability bias in self-reports of plagiarism. In contrast, the C M was more successful in eliciting a significantly higher rate of reported sensitive behavior than direct questioning. One reason for the success of the C M , we believe, is that it overcomes the "self-protective n o " bias known from the R R T (and which may also be a potential problem in the ICT). We find rates of up to 2 2 percent of students who declared that they ever intentionally adopted a passage from someone else's work without citing it. Severe plagiarism such as handing in someone else's paper as one's own, however, seems to be less frequent with rates of about 1 to 2 percent.

1

Introduction

Sensitive behavioral questions are an important source o f systematic measurement error in surveys. As these questions relate t o " t a b o o " topics, e . g . socially undesirable or illegal behavior, interviewees tend to underreport socially undesirable activities and overreport socially desirable ones (Lee 1 9 9 3 ; Tourangeau et al. 2 0 0 0 ; Tourangeau/Yan 2 0 0 7 ) . T h e r e f o r e , the prevalence of the sensitive behaviors in question c a n n o t be validly estimated. Researchers have developed various techniques to reduce the response bias resulting f r o m socially desirable under- or overreporting. A promising approach is to increase the anonymity o f the question-and-answer process and, hence, minimize the respon* **

After long and severe illness, Elisabeth Coutts died of cancer on August 5, 2 0 0 9 . We thank Katrin Auspurg, Georg Böcherer, N o r m a n Braun, Andreas Diekmann, Pascal Gienger, Jochen Groß, T h o m a s Hinz, Julia Jerke, Matthias Naef, Stefan Senn, Philipp Stadelmann, Philipp Stirnemann, and Diego Stutzer for their help with the different data collections reported in this article. This research was supported by the German Research Foundation (Priority Program 1 2 9 2 on Survey Methodology).

750 · Elisabeth Coutts et al. dents' sense of intrusiveness. Some of the techniques that follow this approach are the Randomized Response Technique (RRT; see Warner 1 9 6 5 ; Greenberg et al. 1 9 6 9 ; Fox/ Tracy 1 9 8 6 ) , the Item Count Technique (ICT; see Raghavarao/Federer 1 9 7 9 ; Droitcour et al. 1 9 9 1 ; Dalton et al. 1994) 1 and the Crosswise Model (CM; see Yu et al. 2 0 0 8 ) . We present results of three independent experimental surveys in which these techniques are used to estimate the prevalence of plagiarism in student papers. Plagiarism is a highly sensitive topic as illustrated by the recent news story about the German Federal Minister of Defense, Karl-Theodor zu Guttenberg, whose doctoral degree has been withdrawn because of plagiarism. Plagiarism can be defined as the "appropriation of another person's ideas, processes, results, or words without giving appropriate credit" (Office of Science and Technology Policy 2 0 0 0 : 7 6 2 6 2 ) . With modern Information Technologies and the Internet providing easy access to texts, student plagiarism has increasingly become a problem over the last two decades. Since it threatens one of the main goals of higher education, that is, to qualify graduates for autonomous intellectual work, plagiarism represents a severe form of academic misconduct. To assess the magnitude of the problem, it is important to obtain valid estimates of the prevalence of plagiarism. However, since plagiarism is a sensitive topic, it is difficult to estimate true prevalence rates using surveys. Despite the usual assurance of confidentiality, students might fear sanctions and be unwilling to truthfully answer questions about plagiarism. Special survey techniques such as the RRT, the I C T and the C M can be a remedy, increasing anonymity and allowing for more valid estimates of student plagiarism as compared to direct questioning. In accordance with the "more-is-better" assumption (Lensvelt-Mulders et al. 2 0 0 5 ; Krumpal 2 0 1 0 ) , we would expect that the three techniques yield higher prevalence estimates of plagiarism in student papers than direct questioning. In the next three sections we present results form three experimental studies, one for each technique, to assess whether the techniques are indeed successful in eliciting higher rates of plagiarism than direct questioning. Section five discusses the findings and concludes the article.

2

The randomized response technique (RRT)

The R R T was first introduced by Warner (1965) and later developed into various subforms (e.g., see Greenberg et al. 1 9 6 9 ; Boruch 1 9 7 1 ; Chaudhuri/Mukerjee 1988). All R R T schemes share the common feature of establishing a probabilistic link between the observed answer and the respondent's true state by means of a randomizing device. In Warner's original R R T scheme the respondent is confronted with two statements, the sensitive one ("I have cheated on a written exam at least once") and its negation ("I have never cheated on a written exam"). The respondent then uses a randomizing device with a known probability distribution (e.g. coins or dice) to determine which of the two statements he or she will answer. Since only the respondent knows the random outcome generated by the device, the meaning of a given answer, either "yes" or " n o " , remains ambiguous. That is, nothing definite can be inferred about the respondent's true state from a given answer. It is assumed that respondents appreciate this privacy protection and provide more honest answers than if they were asked directly.

1

The ICT is also known as the Unmatched Count Technique or the List Experiment.

Plagiarism in Student Papers · 751

Despite the ambiguity of the respondents' answers in the RRT, the population prevalence of cheaters can be estimated on the basis of probability theory. In the Warner model the expected value λ of a "yes" response can be expressed as λ = ρπ + (1 — ρ) • (1 — π), where π is the u n k n o w n population proportion of cheaters and p (with ρ ψ 0.5) is the probability of selecting the sensitive statement ("I have cheated on a written exam at least once"). Since the observed sample proportion of "yes" answers provides an estimate of λ, denoted by λ, and probability p is known by design, an estimate of the population prevalence of cheaters can be derived as π — (λ + ρ — l ) / ( 2 p — 1) (for an overview of estimators for different RRT schemes see Fox/Tracy 1986). In a recent meta analysis that, among others, included six studies linking RRT estimates to external validation data, Lensvelt-Mulders et al. (2005) showed that RRT surveys lead to more valid point estimates than conventional direct questioning in different survey modes (self-administered questionnaires, computer-assisted interviews, telephone interviews and face-to-face interviews). However, the analysis also revealed that RRT still underestimates true population rates. Moreover, the results indicated that the performance of the RRT improves with increasing sensitivity of the items. This is consistent with results from previous studies (e.g. Himmelfarb/Lickteig 1982). The gains of the RRT, however, come at the cost of reduced statistical power since random noise is added to the data. Depending on p, the loss in efficiency must be accounted for by an increased sample size (Warner 1965; Lensvelt-Mulders et al. 2005). Another problem is the complexity of the RRT. To use the technique properly, the respondents have to read and comprehend the instructions, which imposes an additional cognitive burden on the interviewees when answering RRT questions (Tourangeau et al. 2000). Thus, RRT typically results in lower response rates (Buchmann/Tracy 1982; Houston/Tran 2001). Furthermore, a more complex answering process is prone to other sources of error such as non-compliance with the instructions or misunderstanding of the technique (Müsch et al. 2001; Lensvelt-Mulders/Boeije 2007; Böckenholt/van der Heijden 2007). One reasonable explanation for why some respondents may not follow the instructions is that they mistrust the privacy protection induced by the procedure (Abernathy et al. 1970; Krótki/Fox 1975; Landsheer et al. 1999; Soeken/Macready 1982). 2.1 Method and data In an experimental web-survey conducted at the Ε Τ Η Zurich (Switzerland) in 2005, the study subjects were asked the following sensitive questions (translated from German): • "In one of these papers (term paper, bachelor, master or diploma thesis), have you ever deliberately concealed a quotation?" • "In another paper during your study time (not term paper, bachelor, master or diploma thesis), have you ever deliberately failed to cite a quotation?" To experimentally evaluate the effect of the R R T on the prevalence estimates, the respondents were randomly assigned one of three versions of the questionnaire: a direct questioning version as the control group and two variants of the RRT. Both RRT implementations had a "forced-choice" design, which is relatively efficient compared to other RRT designs (Boruch 1971). For randomization the respondents were asked to flip a coin two times. The instruction was to answer the sensitive question truthfully in case of heads in the first toss. In case of tails in the first toss, however, the respondents were instructed to automatically answer "yes" or " n o " depending on whether the second toss was heads or tails, respectively. Hence, the probability of an automatic "yes" was p\ = 0.25 and the

752 · Elisabeth Coutts et al. probability of an automatic " n o " was p2 = 0.25. Accordingly, the probability to answer the sensitive question was ¿>3 = 1 — pi — pi = 1 — 0.25 — 0.25 = 0.5. In this design, an unbiased estimate of the population's proportion to which the sensitive question applies is λ —p\ I — 0.25 ~~ ~~pi

~~

CL5

(1)

where λ is the observed sample proportion of "yes" answers. An estimate of the sampling variance of the tirrt is given by

X{i-X) np\

_ 1(1 - λ)

=

~

(2)

0.25«

with η as the sample size. The two R R T variants used in our study differed with respect to the timing of the coin tosses. In the first R R T version, respondents were asked to flip the coin for each of the two sensitive questions separately directly before answering a question. In the second version, the coin flips had to be performed for both questions before turning to the page containing the sensitive questions. In June 2 0 0 5 , 7 , 2 0 1 students of the Ε Τ Η Zurich were invited via e-mail to participate in the web survey. To improve the response rate, a reminder was sent nine days after the initial invitation. As an additional incentive, participants were informed that they could register for a lottery of book vouchers (1 χ 100 C H F and 2 χ 5 0 CHF) after completion of the questionnaire. The overall response rate was 3 2 % with 8 2 9 respondents in the direct questioning condition, 7 2 2 respondents in the first R R T condition, and 7 5 6 respondents in the second R R T condition.

2.2 Results As illustrated in Table 1, the R R T did not yield the expected results. In the table, the estimates from both R R T conditions are combined. The findings do not change if the conditions are analyzed separately (not shown). For direct questioning the estimate of the proportion of students who ever committed a plagiarism is 12.0 % for major papers (term paper, bachelor, master or diploma thesis) and 19.4 % for other papers. Using the RRT, however, the estimates are only 3.7 % and 17.6 % . Although the negative differences between direct questioning and the R R T are not statistically significant (using a conventional 5 % level), they indicate that at least for the question on major papers a

Table 1 Prevalence Estimates of Plagiarism in the RRT Experiment (in percent; standard errors in parentheses) Plagiarism in ... major papers (term paper, bachelor, master or diploma thesis) ... other papers

Direct Questioning

RRT

Difference

12.0 (2.0) Ν = 266 19.4 (1.4) N = 826

3.7 (4.0) N = 495 17.6 (2.4) Ν = 1,521

-8.3 (4.4) -1.8 (2.8)

Plagiarism in Student Papers · 753

substantial proportion of students may not have complied with the RRT instructions. If we assume that students confronted with the sensitive question in the RRT are at least as likely to admit plagiarism than if asked directly, a systematic negative difference can only emerge if some of the students answer "no" although they should have given an automatic "yes" according to the outcome of the randomizing device.

3

The item count technique (ICT)

The ICT is another approach that provides anonymity by obscuring the link between given answers and individual behavior. The advantage of the ICT over the RRT is that it is easier to understand and implement because there is no need for a randomizing device (see, e.g., Coutts/Jann 2011). With the ICT, respondents are asked about the sensitive behavior directly. However, to guarantee anonymity, the respondents are instructed to give a joint answer to a whole list of items. The respondents only indicate how many of the items apply, possibly including the sensitive item, but not which ones. To be able to estimate the prevalence of the sensitive item, two subsamples are generated via randomization. One subsample of respondents answers a short list (SL) of k items excluding the sensitive question. The other subsample answers a long list (LL) including the k items of the short list plus the sensitive item. For example, to estimate the prevalence of exam cheating the following list of items could be used: 1. "Have you visited a zoo last week" 2. "Have you ever been involved in a car accident" 3. "Did you have dinner at a three-star restaurant during the last month" 4. "Did you cheat on a written exam in the past semester" (LL only) The respondents in both subsamples are then asked to report the number of behaviors that apply to them. Since a joint answer for all items is given, there is no certainty about whether a specific respondent engaged in the sensitive behavior or not unless all or none of the items in the list apply. An unbiased estimate of the proportion of cheaters is given in this design by the mean difference of reported counts in the two subsamples: Ä/c

= XLL

-

(3)

XSL

where XN and XSL are the sample means in the group with the long list and the group with the short list, respectively. Furthermore, the sampling variance is Var(ñj c) = Var(xLL)

+ Var(xSL)

(4)

since the two groups are independent by design. To keep the variance low, the lists should not be too long and include non-sensitive items that have a low variance, that is, either have a low prevalence or have a high prevalence (Droitcour et al. 1991). Furthermore, it is beneficial if the items are negatively correlated (Glynn 2009; Stirnemann 2009). Tourangeau and Yan (2007) report in a meta-analysis that the ICT tends to produce higher estimates of socially undesirable behaviors than direct questioning (although the overall effect was not significant in the meta analysis and the variance between studies was high).

754 · Elisabeth Coutts et al.

3.1 Method and data In an experimental online survey conducted at the University of Konstanz (Germany) in summer 2009, the respondents were given lists containing the two following sensitive items (translated from German): 2 • "When writing an assignment (e.g. seminar paper, term paper, thesis), I have intentionally adopted a passage from someone else's work without citing the original." (partial plagiarism) • "I have had someone else write a large part of an assignment for me or have handed in someone else's work (e.g. from www.hausarbeiten.de) as my own." (severe plagiarism) The ICT was implemented using a "double-list" design to reduce variance (Biemer/ Brown 2005; Tsuchiya et al. 2007). In the double list design all respondents answer two lists, a long list and a short list, with differing sets of items. Again the respondents were randomly divided into two subsamples so that the sensitive item could be combined with one set of items in one subsample and with the other set of items in the other subsample. Let SL1 and LL1 be the short and long list versions for the first set of items, respectively. Likewise, LL2 and SL2 are the lists for the second set of items. Subgroup A receives LL1 and SL2, subgroup Β receives SL1 and LL2. The design then provides two separate estimates of the sensitive behavior Ä1 =

XLL\

-

XSL\

a n d

Π2

=

XLL2

~

XSL2

that can be pooled together to reduce the sampling variance. The pooled estimate is KDL

=

πχ+π2 2 {XLLI

— V

_

(XLL1

-

XSLI)

+

{XLLI

-

XSLL)

2 -

+ (*LL2 ~ 2

XSLL)

(X

u

-Χ2,ύ+

XSLL)

— Σ

(5) χ

-

ι.')

where the first sum in the last expression is over the HA observations of subgroup A and the second sum is over the tig observations of subgroup Β. Xi , and X2¡ are respondent i's answers for the first and second list, respectively. The sampling variance of Âq/, is .„ . Var{ñi)+ Var(nDL) = Var(xm

2

Var(n2)+2Cov(ni,n2) ^ - xsu) + Var(xLL2

-

XSLI)

(6)

See the appendix for the complete lists of items. The items were selected from a larger set of items evaluated in a pretest with a random sample of students of the University of Konstanz. These students were not included in the main study. The purpose of the pretest was to obtain an estimate of the correlations among the items. To evaluate the potential efficiency gains that can be achieved by selecting items based on such information we arranged the items such that in one set the items were negatively correlated. As expected, in the main study, the variance of the ICT estimate based on the negatively correlated set was lower than the variance based on the standard set, but the differences were only small (results not shown).

Plagiarism in Student Papers · 755

where Var(xLLi — XSLI) and VAR(X¡X2 — XSLI) can easily be estimated based on observation specific differences between X i and Χχ as can be seen in (5). Apart from the two ICT groups, an experimental control group with direct questioning was included in this study to identify the effect of the ICT on the prevalence estimate. In July 2009, 7,205 students of the University of Konstanz were invited by e-mail to participate in the web survey. Reminder e-mails were sent out ten days after the initial invitation to students who did not respond. As an incentive for participation, respondents could win a gift certificat for books (300 EUR). The overall response rate was 24 % with 552 respondents in the direct questioning group, and 591 and 565 respondents in the two ICT groups. 3.3 Results The ICT yielded a higher estimate of the prevalence of partial plagiarism than direct questioning, as can be seen in Table 2 (9.0 % versus 8.1 %). However, the difference is very small and insignificant. Furthermore, for severe plagiarism, the results are reversed with an estimate of 2.0 % based on direct questioning and an estimate of - 4 . 0 % based on the ICT. Of course a negative prevalence does not make sense, but the ICT estimator can be negative by construction. In the case of a low true prevalence, negative estimates may occur by chance. Another reason for the negative estimate can be that respondents give biased answers. Assuming that the true prevalence is not below 2 % - the estimate from direct questioning - then an ICT estimate of —4 % or below is unlikely (the probability for such a result would be about 9 % given a standard error of 4.5). We therefore suspect that in our study at least part of the respondents systematically underreported the number of behaviors if answering the long list containing the sensitive item. Respondents probably do not feel comfortable if they have the impression that their true answer might make it "look like" they engaged in the sensitive behavior, which may cause them avoiding a high number of reported behaviors. Similar to the RRT, this bias due to self-protective answering appears to be larger for more severe forms of plagiarism. Table 2 Prevalence Estimates of Plagiarism in the ICT Experiment (in percent; standard errors in parentheses) Partial Plagiarism Severe Plagiarism Observations

4

Direct Questioning

ICT

Difference

8.1 (1.4) 2.0 (0.7)

9.0 (4.0) -4.0 (4.4)

0.9 (4.2) -6.0 (4.5)

396

846

The crosswise model (CM) 3

The C M has been developed by Yu et al. (2008) to overcome some of the RRT's shortcomings. The technique is based on a simple idea: A sensitive item X and an unrelated non-sensitive item Ζ with a known prevalence πχ (ηζ φ 0.5) are presented to the respondent. The respondent is then asked to give a joint answer to both items in combination, 3

For a more complete account of this study see J a n n et al. (forthcoming).

756 · Elisabeth Coutts et al. but not answer the items individually. For example, consider exam cheating as the sensitive item and being born in January, February, or March as the non-sensitive item (which can be assumed to be independent of exam cheating and for which the population distribution is known). Respondents are then instructed to choose response option A if both items apply or none applies and chose response option Β if only one of the items applies. Contrary to the R R T there is no obvious evasive answering strategy, because both response options represent "guilty" and "innocent" respondents. Assuming that the two items are uncorrelated, the probability of answer A is λ = πχπζ + (1 — πχ)(1 — πζ), where πχ denotes the unknown prevalence of the sensitive behavior. An unbiased estimator for πχ hence follows as

· » -

:

t e r ·

*

z i £ 0

·

5

τ- ^—

m t 00 IV σι 00 τ-

\0 ο^ IV

νΡ Ο^ IV IV

νο IV

Ò IV

σι m

Ο Γν τ-

SÍ»

τIN

O.

1 D -C Η (Ν _α> -Q

IN IV m IN

sP vD v eroP ^ oo O 00 σι t vp 00 rn σί vo IO V Om

iv

bO c

σι IV Tiσ\

IN m IV νο 00 ι - IV m 00 IV Ο σί " f r IV ' t

νο

O o 0 01

Q O δ IN σι V O 00 IN

νΟ χΟ ΝΡ vp

o^ IN Tσί m

σ> σ> σι rΙΛ >

ιη m σι in IV m νο ΓΜ

^

ο^· ρ^ σι IV m Ò in νο νο

00 m σι in IN

m «t IV Ο vo 00 (ν in σι IN vo O 00 VO in 00 IN VO Τ- IN IN IN

\θ \0

\0 \θ \0 0sO σ> 'Ι- m σ\ 00 IN V Ο o o vo 00 m m σί (ν σί vo 00 vo m m m m IN

^

in o Q m m σι τ— O in O σι o m o σ> ττ-

00 vo IV 00 IN

in in O IV t- O 00 m IV m IV τ- in IV τ— *— σι ¡5 · | Έ a l ^ S t S f . I g S ë i ^ ï g ' S î s S « aj < < < m0 m Ûm o o o o i o U U U U U Ü L L l l i i U ( J U I Ü i J : J S .2 π) ιβ c id ic

«

«

σι 00 ιη ιη

σι IN IV m IV σι σ> t— τ-

ιη σι

\0

Ν Ρ cN IN IN IN IV



ο^ Ο 00 νο

m ο ο τ-

(Ν νρ IV

VO IV σι

CL

t

••fr

*—

ο-m 00 CM m

ο^ \ DC ' σι IV in ιη 00 ΓΜ IV Τ—"t ΡΠ m ΓΜ

\C σ> ΓΜ co (Ν

-ΝΡ ΓΜ V0 00

σι ΓΜ τ-

m ΓΜ 00 σ\ m ιη ο «ΐν" r-

ο IV rr-

σι ΓΜ σι

•o s ι ο ra .y O fc Ν α χ 2 -C > 5 S u o § Ζ Ζ

--Ρ IV m σ ΓΝΙ

V0

σι IV

^P 0svo •sr

• ν? νΡ οòrΝΡ τ- (Νσι IV (Ν ιη ^ Ο 00 ve σι

σ> σ>

V0 ιη σι νο ιη σι σι τ-

(Ν ΡΠ

m

Ν 0sΡ-0s- \0 νο ΓΜ CO ιη ΓΜ (Ν m Ο ιη ΓΜ ΓΜ m

\0 νο m ni m

V0 m Ο τ-

IV σι ιη τ- τ- IV σ\ Ο σι τ—

οο τ— τ—

--Ρ

CL CJ αϊ

reUre 4)

φ 'C ir

M

ιτι-^ιηνοινοοσ,ο^-ΓΜΟΓί'ίΐηνοινοοσιο^-ΓΜΓΓί^-ΐΓίνοΓνοοσιΟ

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 767

country that is contained in all three surveys: The ISSP reports 2 4 % agreement in Spain, the EVS 48 % and the WVS even 52 % . Comparing the seven countries that are listed in the WVS and the ISSP (Canada, Chile, J a p a n , Mexico, Philippines, Spain, USA), the average difference in the agreement rate is 31 percentage points. This comparison demonstrates that small changes in the wording of questions or the number of answer categories can have considerable consequences. Based on these findings we standardize the answer scales by using their complete information, instead of only the proportion agreeing. We did this by treating them as interval scales and recoding them in such a way that higher values indicate stronger agreement. Furthermore, we divided the sum of the values obtained from both items by the sum of the answer categories available. Thus, the new scale is standardized between 0 and 1 where 1 indicates the highest willingness to pay and zero the lowest. On average the countries in the ISSP reach a value of 0.44, the countries in the WVS a value of 0.58, and the ones in the EVS a value of 0.51. Next, we calculated the bivariate correlation between the measured willingness to pay and the purchasing power adjusted gross national product (PPP) of the countries for every survey separately (see Figure 1 where the index displayed on the y-axis is multiplied by 100). Using data of the ISSP we find a positive correlation of 0.54 which is statistically significant (p=0.005). An analysis of the EVS data results in a correlation of -0.04 (p=0.85), and the analysis of the WVS generates a non-significant negative correlation of -0.28 (p=0.177). These results replicate former findings using the ISSP (Franzen 2 0 0 3 , Franzen/Meyer 2 0 1 0 ) and the WVS (Dunlap/York 2008). Dunlap and York report a negative correlation of -0.32 using data from the WVS and EVS, the same measurement of environmental concern, and taking the natural logarithm of the countries' G D P per capita. In our case taking the natural logarithm of the adjusted G D P per capita does not change any of the reported results. Also, using the Spearman rang correlation instead of

100

80

Β

in co Φ c

60

[ ? 40

I

20

0 0

10000

20000

30000

40000 0

10000

20000

30000

40000 0

10000

20000

30000

40000

GDP per capita Figure 1 The correlation between wealth and environmental concern for the ISSP, EVS, and WVS

768 · Axel Franzen and Dominikus Vogl

the Pearson correlation does not lead to any substantial differences in the reported correlation or significance levels. Thus, an analysis of the three international surveys results in the paradoxical finding that the ISSP data produce a positive correlation, analysis of the EVS shows no association and an analysis of the WVS produces a negative correlation. Hence, curiously, the three surveys generate all possible options. A possible solution of the puzzle is the combination of the three data sets. However, the higher levels of agreement in the WVS and EVS as compared to the ISSP which are due to differing methodology are an obvious problem when pooling the data. In addition, crosscultural research has often pointed out that some countries (presumably Non-Western countries) have generally higher levels of acquiescence. Within national studies (e.g. Ross/Mirowsky 1984) it is often found that individuals with low socio-economic status (SES) acquiesce more than respondents with higher education or income. In cross-cultural studies it is additionally argued that cultures can vary with respect to individualism versus collectivism. Collectivistic cultures are supposed to be more group-oriented and to encourage acquiescence more than individualistic cultures (vgl. Bosau 2009; Hofstede 2001; Marin et al. 1992; Smith/Fischer 2008; van de Vijver/Leung 1997). Since the ISSP contains more OECD or developed countries than the WVS or the EVS, which includes many Eastern European nations, cultural differences might also explain some of the variation of acquiescence. 4

Taking acquiescence into consideration

In order to pool and compare the data we first calculated a measure of acquiescence for every respondent in every country. We follow the standard procedure as suggested in the literature of cross-cultural research (e.g. Hofstede 1980; Matsumoto/Yoo 2006; Smith 2004) and selected as many conceptually unrelated items as possible from the surveys' questionnaires which respondents had been asked to agree or disagree on 4 or 5 point Likert-scales. The ISSP 2000 contains 28 questions in addition to the two items measuring the willingness to pay, which are not included for measuring acquiescence. All agreeing answers (agree or strongly agree) were summed up for each respondent and divided by the total number of items considered. This way the measure of acquiescence ranges from 0 to 1 for every respondent and can be interpreted as the proportion of items agreed when multiplied by 100. A value of zero denotes a respondent who never agreed to a statement, irrespective of the content or formulation of the item or its coding. A value of 1 denotes the other extreme, i.e. a respondent who agreed to every statement. The average of all respondents in a given country denotes the degree of acquiescence in that country. The WVS 1999-2001 contains 24 items of which 14 use a four point answering scale and the rest a five point answering scale. Finally, the EVS 1999 has 34 items (of which 15 have four answer categories) suitable for the calculation of the tendency of agreement. Table 3 depicts the results of the measure of acquiescence for the 60 countries contained in the three surveys programs. In the ISSP, New Zealand and Japan have the lowest acquiescence of 0.37, and Portugal is the country with the highest value of 0.65. The average agreement in the ISSP is 0.46. In the WVS, measurement of acquiescence results in an average of 0.60. The lowest value is found for the USA with a value of 0.47, and the highest value in the Philippines with a value of 0.71. Among the European nations in the EVS the Netherlands show the lowest acquiescence with a value of 0.45, and respondents from Romania the highest (0.71) with an average of 0.59. Thus, theses differences

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 769

Table 3 The measurement of acquiescence

wvs

ISSP Country Albania Argentina Austria Bangladesh Belarus Belgium Bosnia and H. Bulgaria Canada Chile China Croatia Czech Republic Denmark Estonia Finland France Germany Great Britain Greece Hungary Iceland India Ireland Israel Italy Japan Kyrgyz Republic Latvia Lithuania Luxembourg Macedonia Malta Mexico Moldova Netherlands New Zealand Norway Peru Philippines Poland Portugal Romania Russia Serbia Singapore Slovak Republic Slovenia South Africa South Korea

2000

0.45

1999-2006

1999-2001

2005-2008

0.64 0.60

0.65

0.56

EVS

1999

2008

0.47

0.60

0.63 0.58

0.61 0.60

0.63

0.67

0.64 0.59

0.53 0.40 0.56

0.57 0.50 0.59

0.46 0.44

0.50 0.52

0.38

0.49

0.43 0.41

0.53 0.50

0.51 0.57 0.66

0.65 0.44 0.49

0.54 0.54

0.37

0.45

0.46

0.51

0.65 0.74

0.72

0.59 0.61 0.51 0.60 0.53 0.61 0.50 0.48 0.59 0.63 0.49 0.49

0.59 0.54 0.59 0.51 0.64 0.58 0.67 0.64 0.56

0.61 0.50 0.65

0.62 0.68 0.65 0.57

0.62 0.59 0.61

0.61

0.63

0.45

0.50

0.70 0.62 0.71 0.65

0.58 0.65 0.64 0.66

0.61 0.61

0.58 0.63

0.60 0.59

0.57

0.41 0.37 0.38

0.43 0.48 0.47

0.52

0.56

0.65

0.60

0.50

0.56

0.53

0.56

0.62 0.63

0.63 0.71

0.61 0.62 0.59 0.57

0.66 0.60

0.62

0.59

0.67 0.65

770 · Axel Franzen and Dominikus Vogl

Table 3 The measurement of acquiescence (Continued) ISSP

WVS

EVS

Country

2000

1999-2006

1999-2001

2005-2008

1999

2008

Spain Sweden Switzerland Tanzania Turkey Uganda Ukraine USA Vietnam Zimbabwe Correlations r

0.52 0.40 0.42

0.56 0.46 0.52

0.50

0.60

0.54 0.52

0.63

0.62

0.68

0.61 0.38

0.51 0.85 * * *

0.47 0.66 0.61

0.54 0.71 0.71**

0.66

0.69

0.60**

Note: Acquiescence was calculated for each individual in every country by dividing the number of respondents' positive answers (strongly agree or agree) by the number of all relevant attitude statements (excluding the willingness to pay items) in the surveys. We took only those respondents into consideration who answered at least 50 % of the rating questions used to construct the measure. All attitude items in the ISSP have 5 answer categories, the WVS and EVS have both 4 and 5 answer categories. Alternatively, we also measured acquiescence in the WVS and EVS by only taking the items with 5 attitude items into account. However, this variation does not affect the measure of acquiescence or any of the results reported in Table 4.

reflect rather closely the differences observed with respect to the willingness to pay among the three surveys. Next, we wondered whether our measure of acquiescence per country depends on a specific survey or whether the results are reliable if the measure is calculated from different waves of the surveys. The ISSP is conducted almost every year in most countries. We picked the surveys from 1999 to 2006 (excluding 2000), calculated the average acquiescence from those years and compared it to our measure obtained with the data from the ISSP 2000. 2 As it turns out the correlation of the measure of acquiescence in 2000 is highly correlated (r = 0.85) with the average of the years 1999 to 2006 suggesting high reliability irrespective of the special topic of the surveys. The WVS was repeated in 2005-2008 and contains 21 suitable items. The correlation between acquiescence measured in 1999-2001 and 2005-2008 is 0.71 suggesting high reliability as well. The EVS was repeated in 2008 with overall 30 attitude items. Here the correlation of the acquiescence between the two surveys results in a correlation of 0.60, which is still a fair value. Thus, overall our measure of acquiescence is rather reliable at the country level and does not depend much on the survey years or the specific topic of the surveys. In order to take the level of acquiescence into consideration we weighted the standardized index of a country's willingness to pay with the reverse value of the coefficient of acquiescence. For example the USA has an average value of environmental concern of 0.33 in the ISSP. However, only a certain proportion of this value is due to the "true" willingness to pay while another part results from the general tendency to agree. Therefore we took the reverse of the coefficient of acquiescence (1 - 0.38 = 0.62) and weighted 2

The number of suitable attitude items contained in the ISSP from 1999 to 2006 varies between 8 and 30.

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 771

ISSP

WVS

r = 0.72"

Τ

!

0

10000

1 20000

1 30000

r = 0.60*'

i 40000 0

1 10000

1 20000

r = 0.49*

! 30000

40000 0

10000

20000

30000

40000

GDP per capita Figure 2 Correlations between wealth and weighted environmental concern for the ISSP, EVS, and WVS (multiplied) the original scale by it. Thus, for the USA this weighting results in a value of 0.21. 3 The correlation of the weighted willingness to pay with countries' wealth (purchasing power adjusted gross national product per capita) for each survey is displayed in Figure 2. Hence taking acquiescence into consideration increases the strength of the positive association for the ISSP and turns the zero association in the EVS and the negative correlation for the WVS into clearly positive and statistically significant correlations. Also, combining all 60 countries from the ISSP, WVS, and the EVS results in a statistically significant correlation of r = 0.49. 4 The correlation using all countries is depicted in Figure 3. The data sets of the 60 countries do not only contain information about environmental concern but also some socio-demographic characteristics of the respondents. In addition, more statistical information about the countries' characteristics is available from the United Nations Development Program or the European Commission. Hence, the data available can be analyzed by multilevel analysis (Snijders/Bosker 1999; Rabe-Hesketh/Skrondal 2008). At the individual level (level 1) theoretical considerations led us to expect that respondents' income, education, and age should affect environmental concern. The wealth effect should not only affect environmental concern at the macro

3

4

Alternatively, the weighting can also be accomplished by multiplying the willingness to pay by the inverse of the acquiescence value. For the USA this procedure would result in 0.33/0.38 = 0.87. This weighting expresses the willingness to pay in relation to the general acquiescence. Values above 1 mean that the willingness is higher than the „ s t a n d a r d " level of acquiescence. Both weighting methods lead to the same result when correlating the weighted willingness to pay with the adjusted GDP. We took the average of a country's willingness to pay and acquiescence in case it w a s contained in more than one survey.

772 • Axel Franzen and Dominikus Vogl

0

10000

20000

30000

40000

G D P per capita (PPP) in 2000 Figure 3 The correlation between environmental concern and wealth in the 60 countries from the ISSP, WVS, and EVS5

level of countries as demonstrated in Figure 3. It should as well explain the inter-individual difference within countries. In the analyses that follow we calculate individuals' household equivalence income by dividing the income of the households by the square root of the number of individuals living together in one household. This procedure has the advantage that we measure the standard of living more accurately than taking only personal incomes that some earners share with family members. Taking the household income also allows measuring the standard of living of respondents who are not active on the labor market or have no personal income. Instead of trying to adjust this equivalent income by the purchasing power, we conducted a z-standardization with the income variable and measure for every respondent the standardized difference from the country's average income. Thereby we measure the income position of a respondent relative to the average income of the country in which he/she lives. Moreover, environmental concern should also depend on the level of education. Respondents with a higher education are better informed about the state of the environment which should also increase their concern and understanding about environmental protection. People with little information and knowledge about the state of the planet cannot be concerned about it. Furthermore, older people should have lower concern for the environment than younger people. This could be due to a cohort effect rather than to the 5

Since Luxembourg is an outlier with respect to GNP per capita it is excluded from Figure 2. This exclusion affects the results only marginally. The correlation including Luxembourg is 0.46.

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 773

effect of aging itself. Younger cohorts were more exposed to environmental concern by public discussions, political debates and media reports than older cohorts. Accordingly, younger generations should be more sensitive to environmental issues. Literature also discusses the effect of gender on environmental concern (see Blocker/Eckberg 1 9 9 7 ; Wilson et al. 1996). Some studies find higher concern among women than men. Also at the macro level certain characteristics of countries could influence environmental concern. Besides the level of wealth itself the distribution thereof has to be taken into consideration. A high level of inequality could direct public attention and politics more towards economic issues and redistribution. These goals could be in competition with environmental issues. To test this hypothesis we calculated the Gini-coefficient from the income data of the surveys. We expect that the Gini-coefficient is negatively correlated with the willingness to pay for a better environment. Environmental concern could also be affected by environmental quality. We therefore included the environmental sustainability index 2 0 0 1 , which was compiled by the Yale Center of Environmental Law and Policy (YCELP), the Center for International Earth Science Information Network of the Columbia University (CIESIN), the World Economic Forum (WEF) and the European Commission. The index consists of different subscales from which we took a country's index of air and water quality as well as the index of biodiversity and soil erosion. If a country has a low local environmental quality it should sensitize respondents and increase the willingness to pay to improve environmental quality. In most countries environmental quality is not evenly distributed but is worse in urban areas than in the more sparsely populated countryside. Generally, heavily populated regions are supposed to have a poorer quality of air and water. Therefore, we also included the population density of a country as well as the proportion of the population living in cities. The denser the population and the higher the proportion of respondents living in an urban area, the worse the objectively and subjectively perceived environmental quality should be. Thus, higher population density as well as a higher proportion of inhabitants living in cities should increase the willingness to pay for environmental protection. 6 We apply a varying-intercept multilevel model to the data and estimate coefficients via the maximum likelihood method. Level one of our analysis takes the variables of individuals (x) into account and level two the country-specific characteristics (z). On the one side the willingness to pay Y,y depends on the characteristics of the i to η individuals. On the other side we consider the country-specific characteristics by varying the slope of the constant /?0; due to the macro level variables ζ of the j to k countries:

Y, = ß0j + ßiXiij + · · · + ß7X7,j + e,j ßoj = roo + yoizi/ + · · · + yo7Z7,· + C¡ The estimation results of the specified models are presented in Table 4. Overall, we report five different models of regressing the willingness to pay on the individual and country specific variables. The first model takes only the 4 9 countries of the WVS and EVS into account. It is therefore basically a replication of Dunlap and York (2008) using multilevel analysis instead of simple correlations. Model 2 controls for acquiescence by introducing the individual specific variable as an independent variable into the regression 6

A detailed description of every variable is contained in Table A of the appendix.

774 · Axel Franzen and Dominikus Vogl

equation. In models 3 and 4 we repeat both analyses by combining all three data sets. Finally, in Model 5 we alternatively control for acquiescence by weighting the dependent variable. A first test of whether multilevel analysis is useful is to estimate the intraclass correlation coefficient (ICC) of the null model. The ICC coefficients of the various models show values between 0.08 and 0 . 1 2 indicating that only a small proportion (8 % to 1 2 % ) of the overall variance is due to the macro level variation. However, multilevel modeling is still applicable. In the first four models the dependent variable is the standardized scale of respondents' willingness to pay multiplied by 1 0 0 (values between 0 and 100). The first model takes only the 4 9 countries contained in the W V S and EVS into consideration. Some smaller countries (Bosnia, Kyrgyz Republic, Luxembourg, Macedonia, Malta, Serbia, and Tanzania) have missing data on important variables and drop out of the analysis. As can be seen all individual variables have the expected effect and these effects change rather little between models 1 through 5. Thus, individuals' education and relative income position are positively related to the willingness to pay. Age has a negative effect and respondents' sex is positively related to environmental concern. At the macro level our main interest is the effect of purchasing power adjusted GDP (PPP) on individuals' willingness to pay. The result in model 1 shows that GDP is negatively (but statistically insignificantly) associated with the willingness to pay. Thus, the multilevel analysis basically replicates the bivariate correlation presented in Figure 1. Additionally, the model takes also the proportion living in urban areas, population density, environmental quality and income inequality (as measured by the Gini-coefficient) into account. None of these macro variables is significantly related to the willingness to pay (as this is the case in further models). Next, Model 2 controls for acquiescence by introducing the individually measured variable as a control into the multi-level regression model. First, acquiescence is statistically significantly related to the willingness to pay (β = 0 . 1 7 , ρ = 0 . 0 0 0 ) . Second, taking acquiescence into account changes the direction of the effect of GDP which is now positively related to the willingness to pay, although statistically not significant. In Model 3 the analysis is extended by incorporating all three data sets. This increases the number of individual observations from 4 9 , 7 8 0 to 7 0 , 9 0 5 . However, the combination of the three data sets increases the number of countries only by 4 (from 4 9 to 53) since many countries (20) are contained in more than one data set. In model 3 the effect of GDP increases in size (β - 0.19) but is still statistically insignificant. However, taking acquiescence in Model 4 again into account not only increases the positive effect of GDP but it also makes the effect statistically significant. The effect of GDP (β = 0.26) has a p-value of 0 . 0 5 5 and is just short of the conventional 5 % significance level. Finally, in Model 5 we use an alternative approach to control for acquiescence. We subtracted each individual's acquiescence value from 1 and multiplied it by the individual's willingness to pay. This weighting procedure is analogous to the one applied to country averages in Figure 3. In Model 5 the positive effect of GDP on the willingness to pay is stronger (β = 0.37) and statistically highly significant (p = 0 . 0 0 0 ) . The model explains 4 6 % of the variance observed at the macro level and 7 % of the variance observed at the micro level, it therefore, indicates the best fit as compared to models 1 through 4 . 7

7

The results are not affected if we use an alternative weighting procedure and multiply the willingness to pay by the inverse of acquiescence.

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 775

« 8 g

;

— ,.— , ^*

VO•• ,—,;» • • in VO in IV m m m (Ν 00 ΓΜ Ο ιη «t in O τ— σ> ιη τ— 00 ΓΜιη IV o ^ ι - o in Τ - 00 τ— Ν τ- 00 ο o o Ο o γμ Ο rri1 ο ι η — Ò Ò t ο

*

5 a l í

— ,_

*



α.

;

« s

«



*

^ — , ;,m — ,ι η ,οο·00m



m 00 IV τ - 00 O to m t t o σ \ Τ- VO Τo ΓΜ Ο Ο

m Ο 00 ιη Γ- ο ο Ο ο

I * 2 .SP '5 î

*

— , „;IVσ ι

00 τ -



τ— I V «5 τIV ο

*

σι i n σ> o ο o

"fr

τ— 00 (Ν ΓΜ ι η Γν (Ν Τ- τ-



in Ο Ο

— ,, — ,, — ,. — ,,

o; 2

Ο

— ,,

IV IV IV • VD ΓΜ τ- ιη 00 Ο 00 VO ι— m m IV in Ο Ο ο νο f> òι Ο ο ο ΓΜ Ο

»

; Μ s : • * J Φ φ Γ IV * 00 IV Ο Vf) Tfr to IV IV * 00 ιη Ο VD VÛ« m 00 V0 Ο ΓΜ ιη t IV ιη ΓΜ m Ο ο τ— ΓΜ Τ- Ο Ο ΓΜ ΓΜ Tf ΓΜ ΓΜ m m Τ- Τ- ο ιη ò Οο S ο VD Ο ò Ο τ— Ο Ο ο ο ο τ—

>

,—. 00 IV ΤΟ

ο 00 « 00 ΓΜ m σ> ΓΜ m 00 Ο ò

— ,,

— ,,

.

σ> τ* τ— ι η m 00 σ> σ ι ι η ο m ko σ> 00 (Ν m IV m Ο Ο ο m νο ΓΜ rn en ΓΜ ΓΜ ι η Ο Ο ο ο ν ο ο

00 m σ \ V0 IV m

Ο

— , ,οο — ,νο , «

m ΓΜ σι m οό Ο

¿

— ,,

VO τ— νο m ιη mο ΓΜ τ— Τ- ο (ν ΟΟ Οο ο

+

— ,m ,ο — ,νο ,τ—— ,ΓΜ„ 00 IV " f r IV ο νο IV ΓΜ m τ— ΓΜ ΓΝ ο σί '—'

a. UÌ

m

— , ,;

,IV4σ>— ,Γν^ιη ,—, ,, a — νο VD

;

; • ; • ρ· ιη ! IV ΓΜγμ m O in τ— 00 Τί- Ο IV (Ν ο σ\ ο in σι σι ο r^ m o m ΓΜ m σι τ- τ1 Ó cri ò O σ* Ο ο ο Ο

οο : τ— m . 00 σ 00 IV Tt 00 m Ο m Τ- τ- ο ο 00 Ο "Ο ο Ö en —

«J -s. o >

si

a. 5 (Ν -v. _ (Λ 0) > Ό > O > SÄ:

m νο ιη ο τ— IV m ΓΜ ο ο Ο Οι

• •

* •



ιη IV IV τ— ΓΜ τ— 8 ΓΜ 00 ΓΜ m ΓΜ τO ιη Ο α\ O τ— Ο



J J — t * m Tt 00 ^ ο ο τ—

«

,, — ,,

ιη τ- Ο m νο Tt τ- m m τ— τΜ) 00 m Ο νο IV ιο — Ο m ΓΜ Τ- τΟ τί ΓΜ 00 00 ιη ΓΜ ΓΜ ο 9Í Ο Ο Ο ο O "τ— —

ο ΓΜ m ο

00 ΓΜ CO Ο Ο Ο

ΓΜ IV ιη ο ο co νο σι Ο τ- ο ÒΟ ο

ιη IV ο ο

m ιη τΟ

τ!" τTt Ο τΐ; IV τ— Οι '—

ιη m ^ ο ΓΜ Ο co ο 00 ΓΜ cri ο τ—

m > _ 1/1 •S >

^;

fNiv©iv o O 00 oö in m

si

Q_ (SI \tΛ __ ΙΛ m > — L1J J> ™ o >

in 00 vo

si

l/l > I en Ç

« >

o σι in

Ό > o > 2¿·

¡£

in J vo ix. 00 rri rri

σ\ VO vo

siili

£Q. u

3 .

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 777

Finally, we analyze the determinants of acquiescence. Literature has often suggested that older people as well as respondents with lower SES have a higher tendency to acquiesce (e.g. Ross/Mirowsky 1984). This expectation is confirmed in model 6 of Table 4, which shows a positive effect for age, increasingly negative effects for the different educational degrees and a statistically significant negative effect for income. Furthermore, our results also indicate that females have a lower tendency to acquiesce. On the macro level the wealth of nations is negatively related to acquiescence. Furthermore, the results also indicate that population density is positively related to acquiescence. This latter finding might be explained by the fact that population density is often high in developing countries. Thus, this variable might be associated with aspects of countries' modernization. Finally, the two survey dummies indicate that acquiescence is higher in the WVS and the EVS than in the ISSP. The ICC of the null model indicates that 26 % of the total variance of acquiescence is due to differences between countries. Model 6 explains 82 % of this macro variance and 12 % of the individual variance. Overall, the results depicted in Table 4 suggest that the negative association between wealth and acquiescence (model 4 and 5) on the one hand and the positive association between acquiescence and the willingness to pay (model 6) on the other hand is responsible for the suppression of the effect of wealth on the willingness to pay if acquiescence is not controlled for. In that case the regression analysis does not distinguish between the effect of wealth and the effect of acquiescence on environmental concern and both effects cancel each other out. Next to wealth, acquiescence depends also on population density. Hence, it seems that poor countries with high population density are most sensible to acquiescence. 5

Conclusions and Discussion

This study analyzes the question why different studies scrutinizing the determinants of environmental concern in cross-cultural perspective come to different conclusions. Studies that are based on data from the ISSP support the wealth effect (Diekmann/Franzen 1999; Franzen 2003; Franzen/Meyer 2010). However, Dunlap and York (2008) as well as Gelissen (2007) argued that environmental concern is stronger in poorer nations and that empirical investigations based on the WVS and the EVS refute the wealth effect. In this contribution we measure environmental concern by two items which ask respondents whether they would be willing to pay higher prices and higher taxes in order to improve the environment. These two items are contained in an almost identical way in all three surveys and therefore the items allow a comparison of the three surveys. We first analyze the three data sets separately and replicate former findings. Thus, the analysis of the ISSP results in a positive correlation between countries' wealth and inhabitants' environmental concern. Analysis of the WVS results in a negative association between wealth and environmental concern, and the EVS shows a correlation of zero. However, the three data sets differ strongly with respect to respondents' level of their willingness to pay. It is comparatively high in the WVS and the EVS and relatively low in the ISSP. On the one side this difference is due to a slight variation in the answering scales of the surveys (four-point versus five-point agreement scales). On the other side there are also large differences between the samples of countries in each survey. We therefore calculated the general tendency of respondents to agree to all kinds of different items contained in the surveys. If this general acquiescence is taken into consideration, the analysis of the pooled data of 60 countries shows a positive and statistically significant

778 · Axel Franzen and Dominikus Vogl

correlation between the countries' wealth and their environmental concern. This fundamental result is robust when we apply multilevel analysis to the data and take further individual and country level effects into consideration. On the individual level respondents' relative income position, their education and age affect the willingness to pay. At the macro level the willingness to pay is also determined by the wealth of nations but not by environmental quality, population density or the inequality of the income distribution. Thus, our analyses of the pooled data from the ISSP, WVS, and EVS support the wealth hypothesis and refute the conclusions of Dunlap and York (2008). The puzzle of the contradicting findings is resolved when the countries' acquiescence is corroborated into the analysis. Respondents in poorer nations in Asia and Eastern Europe have a stronger tendency to agree to survey questions. This tendency of general agreement can also be observed in industrialized countries, however, to a lesser extend. The level of acquiescence we measured may be due to different causes: On the one side it is related to using four or five point Likert-scales. On the other side, our findings suggest that acquiescence depends also on the wealth of nations. Moreover, there might be cultural effects such as differences in individualistic or collectivistic orientations. Unfortunately, the data we use do not contain proper measures to investigate this third possibility. Finally, one note of caution is appropriate. Even though we analyze answers of participants from 60 countries, these countries are not a random sample of all countries but an opportunity sample. It will therefore be left to further research to show whether the results remain robust if more countries participate in the WVS or the ISSP.

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 7 7 9

^ O SPS 9 δ S Ο.ΓΜ° O cd χ o Q. 0) T3 2¡ X c o ï-S υ 0) ν oc

η

on >

> £

o «/> SS λ a

LO i/i TÍ

c «*J3 O S5 cd c

y .2

φ £

a

Ω

¿ JÉ g- Ä

cd §· M SS O "o II 'S a» ^ ü ä· C χ:

χ £ .2 «ί a c c to (U cd o cd Ό > y VI > "C < ο Oèi S .Q Qcd co +J Ό cd cu "5 •5 ο 3 Q.

I > ° È o-

8

°

J ? k r S. 3

•S.C ~ Π3fl>> υ c c i- 3 3

c o

-O

α> Ω _o

fi

a. co

- ¡5 υ tuo . O 3 H o S (IJ Λ _Q >/, -O « XI XI φ ? S ΐΛ «8 o3 ü * (Λ (J S

.2 .o ; - ε o •σ '> ra Ρ m td _ fcE o= " I 'S > , c "O 3 > ΙΛ J S C ' S O" > uì 0J 33 £ E •ä o -g Β γ X r Ë 1 s Si o ° χ- vi ε - S í •S g j j > F_ c ·— 0) E ω Ν ~ ο •S = ¿ . - c üυ αϊ ω — -¿ζ « S 3 ¡ « ^ -ρ 4-1 ni = σ fc o "o" S ¡r-, -M +•» £ «¡T3 Λc -S -2 πι 3 r .•J ¡3 o « i ω Û. CO ±3 £is 5 : 5 S 12 π (λ

3 . JJ Q_ ni on υ LO

io

!fc α; o c b

Pitfalls of International Comparative Research: Taking Acquiescence into Account · 781

References Blocker, J. T., D. L. Eckberg (1997), Gender and Environmentalism: Results from the 1993 General Social Survey. Social Science Quarterly 78: 8 4 1 - 8 5 8 . Bosau, Ch. (2009), Arbeitszufriedenheitsmessungen im interkulturellen Vergleich. Dissertation. Universität zu Köln. Diekmann, A. (2010), Empirische Sozialforschung. (Fourth Edition). Reinbek: Rowohlt. Diekmann, Α., A. Franzen (1999), The Wealth of Nations and Environmental Concern. Environment and Behavior 31: 540-549. Dunlap, R., R. York (2008), The Globalization of Environmental Concern and the Limits of the Postmaterialist Values Explanation: Evidence from Four Multinational Surveys. The Sociological Quarterly 49: 529-563. Franzen, A. (2003), Environmental Attitudes in International Comparison: An Analysis of the ISSP Surveys 1993 and 2000. Social Science Quarterly 84: 297-308. Franzen, Α., R. Meyer (2004), Klimawandel des Umweltbewusstseins? Analysen mit dem ISSP 2000. Zeitschrift für Soziologie 33: 119-137. Franzen, Α., R. Meyer (2010), Environmental Attitudes in Cross-National Perspective: A Multilevel Analysis of the ISSP 1993 and 2000. European Sociological Review 26: 219-234. Franzen, Α., D. Vogl (2011), The willingness to pay for environmental protection: A comparison of the ISSP, WVS, and EVS. Forthcoming. Gelissen, J. (2007), Explaining Popular Support for Environmental Protection: A Multilevel Analysis of 50 Nations. Environment and Behavior 39: 392-415. Grimm, St.D., T.A. Church (1999), A Cross-Cultural Study in Response Biases in Personality Measures. Journal of Research in Personality 33: 415-441. Hofstede, G. Η. (1980), Culture's Consequences: International Differences in Work-Related Values. Newbury Park, CA: Sage Publications. Hofstede, GH. (2001), Culture's Consequences: Comparing Values, Behaviors, Institutions and Organizations Across Nations. Thousand Oaks, CA: Sage Publications. Marin, G., R.J. Gamba, Β.V. Marin (1992), Extreme Response Style and Acquiescence among Hispanics. Journal of Cross-Cultural Psychology 23: 498-509. Matsumoto, D., S.H. Yoo (2006), Toward a New Generation of Cross-Cultural Research. Perspectives on Psychological Science 1: 234-250. Rabe-Hesketh, S., A. Skrondal (2008), Multilevel and Longitudinal Modeling Using Stata. (Second Edition). College Station, Texas: Stata Press. Ross, C.E., J. Mirowsky (1984), Socially-Desirable Response and Acquiescence in a Cross-Cultural Survey of Mental Health. Journal of Health and Social Behavior 25: 189-197. Schaeffer, N . C., St. Presser (2003), The Science of Asking Questions. Annual Review of Sociology 29: 65-88. Schnell, R., P. Hill, E. Esser (2005), Methoden der empirischen Sozialforschung. (Seventh Edition). München: Oldenbourg. Schuman, H., St. Presser (1996), Questions and Answers in Attitude Surveys. Thousand Oaks, CA: Sage. Smith, P. B. (2004), Acquiescence Response Bias as an Aspect of Cultural Communication Style. Journal of Cross-Cultural Psychology 35: 50-61. Smith, P. B., R. Fischer (2008), Acquiescence, Extreme Response Bias and Culture: A Multilevel Analysis. Pp. 288-311 in: F. van de Vijver, D.A. van Hemert, Y.H. Poortinga (eds.), Multilevel Analysis of Individuals and Cultures. New York: Erlbaum. Snijders, Τ. Α., R.J. Bosker (1999), Multilevel Analysis: An Introduction to Basic and Advanced Multilevel Modeling. London: Sage. van Soest, Α., M. Hurd (2008), ATest for Anchoring and Yea-Saying in Experimental Consumption Data. Journal of the American Statistical Association 103: 126-136. van de Vijver, F., Κ. Leung (1997), Methods and Data Analysis for Cross-Cultural Research. Thousand Oaks: Sage Publications. Wilson, M., M. Daly, St. Gordon, A. Pratt (1996), Sex Differences in Valuations of the Environment? Population and Environment 18: 1 4 3 - 1 5 9 .

782 · Axel Franzen and Dominikus Vogl

Prof. Dr. Axel Franzen, Institute of Sociology, University of Bern, Lerchenweg 36, 3 0 0 0 Bern 9, Switzerland. [email protected] Dipl.-Soz. Dominikus Vogl, Institute of Sociology, University of Bern, Lerchenweg 36, 3000 Bern 9, Switzerland. [email protected]

Jahrbücherf. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart2011) Bd. (Vol.) 231/5+6

Buchbesprechungen / Book Reviews Aoyama, H., Y. Fujiwara, Y. Ikeda, H. lyetomi, W. Souma, Econophysics and Companies. Statistical Life and Death in Complex Business Networks. Cambridge (Cambridge University Press) 2010, pp. 258, f 60.00, ISBN 978-0-521-19149-4. The term Econophysics was (probably) coined by physicist H. E. Stanley more than ten years ago, indicating the area of research where economic problems are analyzed from the perspective of physicists. Whereas for a long time econophysicists focused on financial markets and their empirical regularities, some interest on firms' dynamics has recently emerged. The low-cost availability of large data sets for firms' characteristics, such as their balance sheets, is probably the main motivation for the emergence of such interest among the community of statistical physicists. Large data sets are indeed a prerequisite to identify robust empirical regularities. Within this emerging trend, the authors nicely illustrate the results of a statistical analysis for a comprehensive data set on Japanese firms (several millions). I must confess that I was expecting a book made up of a collection of previously published results in an almost unchanged article format - a recent publishing trend in this field. On the contrary, I immediately recognized the effort made by the authors in presenting self-contained material, with an extensive theoretical part and a large set of empirical results. This book has six chapters: after an introductory chapter, the second one is devoted to the presentation of theoretical and empirical material on statistical distributions, with emphasis on Pareto distributions. I found the authors' interpretation of the Pareto distribution in terms of economic concepts such as oligopoly and monopoly intriguing. The third chapter illustrates the empirical analysis on firm dynamics, with special focus on the growth-rate distribution of several firm properties (such as sales, profit and sale) and their connection with Gibrat's law. A minor comment: from the viewpoint of an economist, it seems a bit strange to restrict the growth-rate analysis only to positive profits, ignoring the possibility of switching from profits to losses (definitely deficits as mentioned in the book). The fourth chapter shows many empirical results on several network types, such as shareholder or transaction networks. Of special interest is the part illustrating the correlation among different networks. The fifth chapter introduces an agent-based model able to reproduce the previous empirical regularities. Finally, the last chapter is devoted to possible practical applications of the previous concepts and results. I found the book interesting and, in general, elegantly structured. It is an excellent reference for a physicist willing to start his/her research on firms' dynamics and industrial organization, since it gives a quite complete overview of the relevant results available so far in the literature and it illustrates many economically meaningful open questions. I would claim that it is not suitable for researchers trained in 'traditional' or mainstream Economics. They are not likely to appreciate the underlying meaning of the statistical analysis oriented towards finding emergent properties under aggregation. Moreover, they would miss the consequences of the empirical findings presented in the book. Castellón (Spain)

Simone Alfarano

Jahrbücher f. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011 ) Bd. (Vol.) 231/5+6 Postler, Andreas, Nachhaltige Finanzierung der Gesetzlichen Krankenversicherung. Berlin (Duncker & Humblot) 2010, 245 S„ € 68,-, ISBN 978-3-428-13391-8. Jeder Bürger weiß heute, dass der demographische Wandel die langfristige Finanzierung unserer Sozialversicherungssysteme vor große Herausforderungen stellt. Dennoch ist ihnen politisch bisher kaum begegnet worden. Die Ausnahme ist die Gesetzliche Rentenversicherung, in der seit 2005 der Nachhaltigkeitsfaktor für einen langsameren Anstieg der Renten sorgt. In der Krankenund Pflegeversicherung hingegen besteht noch Handlungsbedarf, will man nicht hohe Beitragslasten auf zukünftige Generationen verlagern. Andreas Postler untersucht in seinem Buch, das als Dissertation an der Universität DuisburgEssen angenommen wurde, die Nachhaltigkeit der Finanzierung der Gesetzlichen Krankenversicherung (GKV). Hierzu betrachtet er die Folgen des demographischen Wandels auf die Einnahmen- und Ausgabenseite. Auf der Einnahmenseite sind die Effekte schnell umrissen. Das sinkende Verhältnis von Beitragszahler zu Rentner verschlechtert die Finanzierungsbasis der GKV, denn die Beiträge der Rentner decken bei weitem nicht ihre Ausgaben. Die Entwicklung der Ausgabenseite hingegen ist schwerer prognostizierbar. Es ist unklar, welche Ausgaben die verschiedenen Altersgruppen in Zukunft verursachen. Einig ist man sich, dass die Menschen in Z u k u n f t vermutlich länger leben werden. Ob sie die zusätzlichen Lebensjahre in guter oder schlechter Gesundheit verbringen, ist jedoch strittig. Nach der Medikalisierungsthese werden sie diese Jahre eher krank und deshalb für das Gesundheitswesen teurer verbringen. Nach der optimistischeren Kompressionsthese verschieben sich die Krankheitsjahre nach hinten. Das Buch beginnt mit einer Einführung in die Problemlage und erläutert wichtige Kennzahlen der GKV und der demographischen Entwicklung. Daran schließt sich eine Analyse demographischer und medizinisch-technischer Effekte auf den Beitragssatz der GKV. Das wissenschaftliche Kernstück stellt das vierte Kapitel dar. Mit Simulationen wird analysiert, wie die Demographie und die Entwicklung der Medizintechnik den Beitragssatz der GKV beeinflussen. Der Autor untersucht verschiedene Szenarien auf der Ausgaben- und Einnahmeseite. Neben dem Standardansatz, der vom heutigen Ausgabenprofil ausgeht, berücksichtigt er Sterbekosten und betrachtet eine Rechtsverschiebung des Ausgabenprofils im Ausmaß der prognostizierten Erhöhung der Lebenserwartung. Bei einem Anstieg der Lebenserwartung um vier Jahre bis 2050 bedeutet dies z.B., dass die Ausgaben einer 67-jährigen Person denjenigen einer 63-jährigen heute entsprechen. Auf der Einnahmenseite werden unterschiedliche Entwicklungen des Rentenniveaus betrachtet. Die Analyse wird ergänzt um medizinisch-technischen Fortschritt, der als exogener Ausgabenanstieg mit Raten zwischen 0,4 und 1,6 % p.a. modelliert wird. Der Autor stellt als Ergebnis heraus, dass im besten Fall der Beitragssatz auf 26,9 % im Jahr 2050 ansteigt, während er im schlechtesten Fall 34,9 % beträgt. In seinem fünften Kapitel erörtert der Autor, wie Kapitaldeckung in die GKV eingeführt werden könnte. Simulationen zeigen, wie ein kollektives Kapitaldeckungsverfahren mit intergenerativer Umverteilung als auch ein individuelles Kapitaldeckungsverfahren zu einer Beitragssatzglättung beitragen können. Schließlich spricht sich der Autor für ein Reformmodell mit Pauschalprämien, individueller Teilkapitaldeckung und einem Solidarausgleich aus. Dissertationen werden häufig nur von den Gutachtern und wenigen Fachinteressierten gelesen. Das Buch von Postler richtet sich hingegen an eine breitere Öffentlichkeit. Die Diskussion der Ausgangslage ist dabei durchaus informativ. Auch die grundsätzlichen Probleme bei Vorhersagen über die Entwicklung der Ausgaben im Gesundheitswesen werden gut dargestellt. Die selbst durchgeführten Simulationen sind allerdings mit Vorsicht zu genießen. Der Autor verzichtet auf eigenständige ökonometrische Analysen und bleibt damit unter dem Standard der Fachliteratur. Zudem entsteht der Eindruck, dass die Ausgabeneffekte übertrieben hoch und die Einnahmeeffekte zu gering dargestellt werden. Dies drückt sich nicht nur in der Wortwahl aus (der Tragödie erster bzw. zweiter Teil), sondern auch in der Wahl der Annahmen. So geht der Autor von konstanten Löhnen aus. Ebenso werden nicht die Auswirkungen der Erhöhung des Renteneintrittsalters betrachtet. Bei der Berücksichtigung des medizinisch-technischen Fortschritts scheint zudem die Wahl des „Best-Case"-Szenarios willkürlich. Seine eigenen Untersuchungen hätten einen Beitragssatz von 20,6 statt 26,9 % nahegelegt. Der Leser, der sich tiefer

Buchbesprechungen · 785

mit der Thematik befassen möchte, muss deshalb an die einschlägigen Beiträge in Fachzeitschriften verwiesen werden. Überraschend ist, dass das Buch den Begriff der Nachhaltigkeit nicht definiert. Implizit scheint der Autor darunter Beitragssatzstabilität zu verstehen. Er erwähnt zwar, dass dieses Konzept wohlfahrtstheoretisch fragwürdig ist, verzichtet aber auf eine präferenzorientierte Analyse. So wird in der Literatur durchaus die Ansicht vertreten, dass die Bürger bereit sind, in Zukunft einen größeren Anteil ihres Einkommens für Gesundheitsleistungen auszugeben. Im Einklang mit der Fachliteratur ist hingegen das Ergebnis des Autors, dass auch unter günstigen Umständen - separater Berücksichtigung von Sterbekosten und bei Gültigkeit der Kompressionsthese - keine Entwarnung für die Zukunft der GKV gegeben werden kann. Universität Hamburg

Mathias Kifmann

Ramser, Hans J., Manfred Stadler (Hrsg.), Marktmacht. Tübingen (Mohr Siebeck) 2010, 285 Seiten, W S O 39, € 49,-, ISBN 978-3-16-150746-5 Der Tagungsband Marktmacht umfasst die Referate und Koreferate des 39. wirtschaftswissenschaftlichen Seminars Ottobeuren aus dem Jahr 2009. Die Beiträge reichen von der Theorie über die Empirie bis hin zu praxisnahen Fragestellungen und wurden sowohl von Wissenschaftlern als auch von Praktikern aus Wettbewerbs- und Regulierungsbehörden verfasst. Der Titel Marktmacht wird dabei weit interpretiert, da der Band eine Vielzahl von Fragestellungen aus den Bereichen Wettbewerbspolitik und Regulierung abdeckt, die u. a. von der Marktabgrenzung über die Anwendung von Kronzeugenregelungen als Instrument der Kartellbekämpfung bis hin zur Analyse von Investitionsentscheidungen in liberalisierten Strommärkten reichen. Positiv hervorzuheben ist insbesondere die Gegenüberstellung der Tagungsbeiträge und der zugehörigen Koreferate. Letztere ermöglichen dem Leser eine bessere Einordnung der Beiträge in die wissenschaftliche Literatur, präsentieren weiterführende Aspekte und zeigen die Erkenntnisgrenzen der Beiträge auf. Einem einleitenden Überblick folgt Stadlers modelltheoretischer Aufsatz Market Structure, Spillovers and Licensing in R&D Contests. In diesem diskutiert er den Zusammenhang zwischen Marktkonzentration und Innovationsaktivitäten unter expliziter Berücksichtigung von Inputund Output-Spillovereffekten sowie technologischen Entwicklungsmöglichkeiten von Unternehmen. Stadler leitet für eine gegebene Höhe von Spillovereffekten einen umgekehrt U-förmigen Zusammenhang zwischen Marktkonzentration und Innovationsaktivität im Markt ab. Während hohe Marktanteile sich annahmegemäß positiv auf die Möglichkeiten zur Entwicklung neuer Technologien auswirken, mindern Input- und Output-Spillovereffekte mit zunehmender Konkurrenz die Investitionsanreize in Forschung und Entwicklung. Czarnitzki und Kraft untersuchen in ihrem empirischen Artikel Technologieakquisition und Marktstellung mittels einer Tobit-Schätzung den Einfluss verschiedener Determinanten der Marktstruktur (ζ. Β. HHI, Bedeutung von Importen und Exporten) und unternehmensbezogener Variablen (ζ. B. Anzahl der Beschäftigten und Alter) auf die Ausgaben dieser Unternehmen für Lizenzen und ähnlicher Nutzungsrechte, d. h. auf das Innovationsverhalten dieser Unternehmen. Die Arbeit basiert dabei auf Daten des Mannheimer Innovationspanels. Die Autoren zeigen insbesondere, dass etablierte Unternehmen höhere Lizenzausgaben tätigen als Unternehmen, die einen Markteintritt planen. Diese Ergebnisse stützen somit ein theoretisches Modell von Gilbert und Newberry (1982). Sie stehen hingegen im Gegensatz zu den in der Literatur häufig diskutierten Modellen zu Patentrennen. In diesen Modellen tätigen potenziell in den Markt eintretende Unternehmen höhere Ausgaben für F&E-Projekte, wenn diese (drastische) Innovationen hervorbringen. In ihrem Beitrag Endogenous Merger Formation and Weifare in Asymmetrie Markets untersuchen Stadler und Neubecker die einzelwirtschaftliche Profitabilität sowie die gesamtwirtschaftlichen Wohlfahrtseffekte von Fusionen in einem Cournot-Spiel mit drei bzw. vier Unternehmen,

786 · Buchbesprechungen

die ein homogenes Produkt herstellen. Die Produktionsstruktur der Unternehmen ist durch die Abwesenheit von Fixkosten sowie asymmetrische Grenzkosten gekennzeichnet. Sie schlussfolgern, dass sowohl im Falle einer geringen Kostenasymmetrie aufgrund des Merger Paradoxons (Salant et al. 1983) als auch bei sehr starken Kostenasymmetrien keine Fusionen stattfinden. Im letzteren Fall verfüge das ineffizienteste Unternehmen nur über einen geringen Marktanteil und eigne sich daher nicht als Fusionspartner. Bei m o d e r a t asymmetrischen Kosten seien Fusionen hingegen einzelwirtschaftlich sinnvoll, und könnten aufgrund der Eliminierung ineffizienter Produktionsstrukturen auch hinsichtlich ihrer Wohlfahrtswirkungen positiv beurteilt werden. Wettbewerbsbehörden sollten einer efficiency defense folglich insbesondere bei derart charakterisierten Fusionen Beachtung schenken. Bedauerlich ist, dass der Beitrag sich auf eine theoretische Behandlung dieses Sachverhalts beschränkt und lediglich in der Schlussfolgerung den Rohölmarkt als Beispiel für eine Branche benennt, die im Rahmen dieses Modells behandelt werden könne. Von Interesse für die praktische Wettbewerbspolitik wäre eine weitergehende Untersuchung gewesen, welche realen Branchen die teils starken Annahmen des Modells zu erfüllen vermögen. H a u c a p , Heimeshoff und Schultz beschreiben in Legal and Illegal Cartels in Germany between 1958 and 2004 einen neuen Datensatz, in dem sie Daten zu 864 legalen und 95 illegalen, deutschen Kartellen aus dem Zeitraum 1958 bis 2 0 0 4 zusammengetragen haben. Bei den legalen Kartellen handelt es sich um Ausnahmen vom Kartellverbot des S 1 GWB, die bis 2 0 0 4 vom Bundeskartellamt entsprechend der §§ 2-7 GWB genehmigt werden konnten. Der Datensatz beinhaltet u. a. Variablen zur Identität und Anzahl der beteiligten Unternehmen, zur betroffenen Branche sowie zur Dauer und Art des Kartells. Dabei zeigt sich, dass sowohl bei den illegalen als auch bei den legalen Kartellen Absprachen in der Bauindustrie und ihr vorgelagerten Branchen mit bis zu 40 % der beobachteten Kartelle eine prominente Rolle einnehmen. Erste ökonometrische Auswertungen zeigen, dass insbesondere Kartelle mit mehr als 12 und weniger als 5 Mitgliedern eher kurzlebig, d. h. instabil zu sein scheinen. Tiefergehende Auswertungen sollten in diesem Z u s a m m e n h a n g folgen. Um einen f r o m m e n Wunsch zu formulieren: Die Wissenschaftsgemeinde stünde einer Publikation dieses Datensatzes sicher sehr aufgeschlossen gegenüber. Schwalbe trägt mit einem Übersichtsartikel über die ökonomischen Grundlagen der Kronzeugenregelung als Instrument der Kartellbekämpfung zu diesem Tagungsband bei. In diesem Beitrag stellt er kurz die historische Entwicklung und den derzeitigen Stand der Kronzeugenregelungen in den USA, Europa und Deutschland vor und ergänzt weiterhin aktuelle Fragestellungen wie bspw. die Behandlung der Rädelsführer, die Veranlassung des Kronzeugen durch die zuständige Wettbewerbsbehörde zur weiteren Beweissammlung im Kartell zu verbleiben sowie das Verhältnis zwischen Bonusregelungen und Schadenersatzklagen. Der Schwerpunkt der Arbeit liegt auf der verbalen Zusammenfassung relevanter ökonomisch-theoretischer Literatur zu Kronzeugenregelungen beginnend mit M o t t a und Polo (1999) bis hin zu Harrington (2008). Dieser Überblick richtet sich sowohl an Studierende der Ö k o n o m i e als auch an Praktiker, die einen Ausgangspunkt suchen, um sich in dieses Thema einzuarbeiten. Positiv zu erwähnen ist, dass dieser theoretische Überblick um eine Zusammenfassung empirischer und experimenteller Evidenz v. a. aus der Zeit ab 2005 ergänzt wird. Lesenswert ist auch N e u s ' Koreferat, der die Wirkungen von Bußgeldern und der Kronzeugenregelung auf die Kartellstabilität darstellt. Dabei beschränkt er sich nicht nur auf das „übliche" Modell des unendlichen Superspiels. Vielmehr zeigt er, dass Kronzeugenregelungen in endlichen, wiederholten Spielen mit Typenunsicherheit ebenfalls destabilisierend auf Kartelle wirken. C. C. von Weizsäcker nutzt seinen Artikel zum Konzept des relevanten Marktes für die Feststellung von Marktmacbt zur Darstellung einiger interessanter Diskussionspunkte im R a h m e n der Marktabgrenzung. So erweitert er den Substitutionskettenbegriff auf zweidimensionale geographische Räume und geht auf die Bedeutung von Marktasymmetrien im Rahmen der Bestimmung des relevanten M a r k t e s ein. Stahls Koreferat unterstreicht die N a t u r dieses Beitrags als Denkanstoß. So führt Stahl an, dass v. Weizsäckers zweidimensionaler Ansatz zur Darstellung des Substitutionskettenzusammenhangs für die Marktabgrenzung in der Fusionskontrolle noch zu vereinfachend sei. Ferner bezweifelt er die Notwendigkeit von Marktasymmetrien für die Anwendbarkeit des SSNIP-Tests. Weitere Forschung zu diesen Themen darf folgen.

Buchbesprechungen • 787

Möschel präsentiert seine Sicht auf den more economic approach in Verfahren des Missbrauchs einer marktbeherrschenden Stellung. Im Kern argumentiert er, dass der ökonomische Zugang zu diesen Fragestellungen so alt sei wie das Wettbewerbsrecht selbst, und dass das Neue lediglich in der Steigerung des more liege. Oppenländer widerspricht dieser Einschätzung durch die Betonung, dass der more economic approach einen grundlegenden Wechsel in der europäischen Wettbewerbspolitik bedeute. Ferner seien die von Möschel dargestellten ernüchternden Erfahrungen mit diesem Ansatz eher auf dessen bislang kurze Anwendungszeit zurückzuführen. Die in Referat und Koreferat vertretenen Standpunkte illustrieren dabei gut die von Ewald (2011) geäußerte Forderung, dass eine Bringschuld der Ökonomie hinsichtlich des more economic approach in einer stärkeren Diskussion guter und schlechter ökonomischer Analyse liegen müsse. Die Bringschuld der juristischen Seite bestehe hingegen in der Entwicklung eines verfeinerten Verständnisses für die Methoden der Wettbewerbsökonomie. Roller präsentiert in einem so prägnanten wie informativen Beitrag eine Ubersicht zur praktischen Bedeutung von Effizienzeffekten in der Fusionskontrolle. Anhand 37 zwischen 2004 und 2009 veröffentlichter Phase Ii-Entscheidungen der Kommission zeigt er, dass in sechs dieser Fälle (dynamische) Fixkosteneffizienzen behauptet wurden, während in fünf dieser sechs Fälle zusätzlich (statische) Effizienzeffekte hinsichtlich variabler Kosten geltend gemacht wurden. Aus der Beobachtung, dass die Kommission die Behauptung realisierbarer Effizienzen lediglich in zwei Fällen akzeptierte, in denen diese nicht einmal für den Fall entscheidend waren, schließt Roller, dass Effizienzeffekte seit 2004 in der Europäischen Fusionskontrolle keine herausragende Bedeutung gespielt haben. Strausz befasst sich in seinem Artikel Regulatory Risk and Optimal Incentive Regulation: The Two Type Case mit den Auswirkungen von Unsicherheit über Änderungen in Regulierungsmaßnahmen (regulatives Risiko) auf die Gewinne und Investitionen der regulierten Unternehmen. Auf Basis eines Armstrong-Sappington-Modells (2008) analysiert er insbesondere die Auswirkungen gekrümmter Nachfragekurven. Für konvexe Nachfragekurven zeigt Strausz (entgegen der herrschenden Meinung in der Literatur), dass sich regulatives Risiko positiv auf erwartete Unternehmensgewinne auswirken und somit investitionsfördernden Charakter haben kann. Für konkave Nachfragekurven wirkt sich regulatives Risiko hingegen investitionshemmend aus. Wie Alós-Ferrer in seinem Koreferat betont, ist dieser Beitrag somit als Aufforderung zu verstehen, Nachfragefunktionen in regulierten Märkten empirisch zu schätzen. Mit dem modelltheoretischen Papier Investment Decisions in Liberalized Electricity Markets: The Impact of Market Design zeigen Grimm und Zöttl, dass eine sinnvolle Ausgestaltung des Emissionshandels das Investitionsverhalten marktmächtiger Stromanbieter anregen und gleichzeitig die Reduktion des C02-Ausstoßes fördern kann. Regulierungsbehörden verfügen zur Erreichung dieser Ziele über zwei wesentliche Gestaltungsoptionen: Erstens die Wahl des Emissionsziels und zweitens der Wahl des Vergabemechanismus der Zertifikate beispielsweise in Form einer Auktion oder einer (teilweise) kostenlosen Vergabe (grandfathering). Um Investitionsanreize zu erhöhen, schlagen die Autoren vor, dass den Stromerzeugern ein Teil der Zertifikate kostenlos zugeteilt werden sollte. Im Hinblick auf Umweltziele sei darüber hinaus eine überproportional starke Zuteilung an bereits überdurchschnittlich emissionsarm produzierende Unternehmen geraten. Fraglich ist dabei jedoch, ob solche Strategien innerhalb existierender Regulierungsrahmen implementierbar sind. Henseler-Unger diskutiert den Zusammenhang zwischen Regulierung, Infrastrukturinvestitionen und Innovationen in Telekommunikationsmärkten aus der Sicht der Bundesnetzagentur. In ihrem deskriptiven Beitrag spannt sie einen Bogen vom rechtlichen Rahmen, in dem sich Regulierungsbehörden bewegen, über die Aufteilung von Zuständigkeiten zwischen Bundesnetzagentur und Bundeskartellamt bis hin zur Diskussion, inwieweit die Regulierungsbehörde über flexible Regulierungsmaßnahmen zur Erhöhung der dynamischen und statischen Effizienz in Telekommunikationsmärkten beigetragen habe. Für Henseler-Unger stehen Regulierung und Innovationstätigkeit in keinem Widerspruch. Regulierung fördere vielmehr intra- und intermodalen Wettbewerb und somit Investitionen in Next Generation Access Networks. Dies führe dazu, dass sich der Regulierer „weiter zurücklehnen" könne.

788 · Buchbesprechungen

Kaiser, Mendez und R.0nde untersuchen empirisch die Auswirkungen der Änderung des dänischen Berechnungssystems der Referenzpreise für Lipidsenker-Präparate. Vor dem 1. April 2005 w u r d e n Referenzpreise als Durchschnittspreise vergleichbarer Produkte in Europa berechnet, danach wurde das günstigste Produkt innerhalb dieser Gruppe als Referenz verwendet. Mittels der durchgeführten Ereignisstudie identifizieren die Autoren eine Senkung der Verkaufspreise. Abschließend fassen wir zusammen, dass die Essaysammlung Marktmacht eine Sammlung von lesenswerten Denkanstößen zu diesem T h e m a aus der Sicht der Wettbewerbs- und Regulierungspolitik darstellt. Der Band ist nicht als lehrbuchartige Darstellung des T h e m a s M a r k t m a c h t und damit zusammenhängender Fragestellungen konzipiert, wenngleich einige Beiträge zu Teilaspekten dieses umfangreichen Themas durchaus als Einstiegsliteratur verwendet werden können. Diese stehen neben Essays wie ζ. B. der Vorstellung eines neuen Datensatzes zu Kartellen sowie einer Darstellung der Berücksichtigung von Effizienzeffekten in der europ. Fusionskontrolle, die geeignet sind die Diskussion der jeweiligen Themen weiter zu beeinflussen. Literatur

Armstrong, M., D . E . M . Sappington, (2008), Recent Developments in the Theory of Regulation. S. 1557-1700 in: M . Armstrong, D . E . M . Sappington (Hrsg.), H a n d b o o k of Industrial Organization Vol (3), New York. Ewald, C. (2011), Ökonomie im Kartellrecht: Vom more economic approach zu sachgerechten Standards forensischer Ökonomie. Zeitschrift für Wettbewerbsrecht 9: 15-47. Gilbert, R., D . M . G. Newbery (1982), Preemptive Patenting and the Persistence of Monopoly. American Economic Review 72: 514-526. Harrington, J. E. (2008), Optimal Corporate Leniency Programs. Journal of Industrial Economics 56: 215-246. M o t t a , M., M . Polo (1999), Leniency Programs and Cartel Prosecution. European University Institute Working Paper. ECO No. 99/23: 1-36. Salant, S., S. Switzer, R. Reynolds (1983), Losses Due to Merger: The Effects of an Exogenous Change in Industry Structure on Cournot-Nash Equilibrium. Quarterly Journal of Economics 98: 185-200.

Justus-Liebig-Universität Gießen

Tim Brühn und Johannes Paha

Neu bei Mohr Siebeck Friedrich A. von Hayek Gesammelte Schriften in deutscher Sprache Abt. Β Band 7: Die verhängnisvolle Anmaßung. Die Irrtümer des Sozialismus Hrsg. v. Viktor Vanberg Übersetzt von Monika Streissler 2011. Ca. 230 Seiten. ISBN 978-3-16-149949-4 Ln ca. € 65,-; in der Subskription ca. € 50,- (Oktober)

Andreas Heinemann »Ökonomischer Patriotismus< in Zeiten regionaler und internationaler Integration Zur Problematik staatlicher Aufsicht über grenzüberschreitende Unternehmensübernahmen 2011.122 Seiten (Beiträge zur Ordnungstheorie und Ordnungspolitik 175). ISBN 978-3-16-150786-1 B r € 3 4 , -

Marktmacht Hrsg. v. Hans J. Ramser und Manfred Stadler 2010. VIII, 285 Seiten (Wirtschaftswissenschaftliches Seminar Ottobeuren 39). ISBN 978-3-16-150746-5 fBr € 49,-

SSStht

Marktwirtschaft und Menschenrechte Wirtschaftsethische Dimensionen und Herausforderungen Hrsg. v. Olaf J. Schumann, Hille Haker und Martin Schröter 2011. Ca. 340 Seiten. ISBN 978-3-16-150887-5 fBr ca. € 50,- (November)

Marktwirtschaft und soziale Gerechtigkeit Gestaltungsfragen der Wirtschaftsordnung in einer demokratischen Gesellschaft Hrsg. v. Viktor J. Vanberg 2011. Ca. 320 Seiten (Untersuchungen zur Ordnungstheorie und Ordnungspolitik). ISBN 978-3-16-150714-4 fBr ca. € 55,- (November)

Werner Neus Einführung in die Betriebswirtschaftslehre aus institutionenökonomischer Sicht 7., Überarb. A. 2011. XXII, 620 Seiten (Neue Ökonomische Grundrisse). ISBN 978-3-16-150906-3 fBr € 29,-

Douglass C. North / John J. Wallis / Barry R. Weingast Gewalt und Gesellschaftsordnungen Eine Neudeutung der Staatsund Wirtschaftsgeschichte Übersetzt von Monika Streissler 2011. XVII, 326 Seiten (Die Einheit der Gesellschaftswissenschaften 145). ISBN 978-3-16-150590-4 Ln € 94,-

Max Weber Max Weber-Gesamtausgabe Band 11/10: Briefe 1918-1920 Hrsg. v. Gerd Krumeich u. M. Rainer Lepsius 2011. Ca. 1100 Seiten. ISBN 978-3-16-150895-0 Ln ca. € 385,-; in der Subskription ca. € 325,ISBN 978-3-16-150897-4 Hldr ca. € 440,-; in der Subskription ca. € 380,(Dezember)

Mohr Siebeck Maßgeschneiderte Informationen: www.mohr.de

Tübingen [email protected] www.mohr.de

Georg Riiter, Patrick Da-Cruz, Philipp Schwegel (Hrsg.)

Gesundheitsökonomie und Wirtschaftspolitik 2011. XII/652 S„ geb. € 64,-. ISBN 978-3-8282-0543-7 Zum Geburtstag von Prof. Dr. Dr. h. c. Peter Oberender greift der vorliegende Sammelband die Vielfältigkeit der beiden Themenbereiche Wirtschaftspolitik und Gesundheitsökonomie auf und lässt renommierte Weggefährten des Jubilars zu Wort kommen. Ausgehend von der „Oberenderschen MikroÖkonomie" werden ausgewählte Aufsätze zur Wirtschaftspolitik vorgestellt. Der Hauptteil des Buches konzentriert sich auf die Gesundheitsökonomie. Neben allgemeinen Aufsätzen zur Gesundheitsökonomie wird im Speziellen auf den Krankenhausmarkt und die Pharma- und Medizinprodukteindustrie eingegangen. Inhaltsübersicht: Ulrich Fehl, Theorie und Praxis: Zur Aufgabe der universitären Ausbildung im Bereich der Wirtschaftswissenschaften Wirtschaftspolitik Manfred E. Streit, Rechtsstaat und Sozialstaat - ein ordnungspolitischer Gegensatz Nobert Eickhof, Wettbewerbspolitik versus „neue" Industriepolitik Jörg Dötsch und Stefan Okruch, Die Entgrenzung der Ordnung in Hayeks Systemtheorie Volker Emmerich, Überlegungen zur Marktabgrenzung Gesundheitsökonomie Eugen Münch, Gesundheitswirtschaft als Wachstumsmotor? Nikolaus Knoepffler/R. Albrecht, Verteilungsgerechtigkeit im Gesundheitswesen

LUCIUS LUCIUS

Jürgen Zerth, Zweiseitige Märkte und Gesundheitswirtschaft: Übertragbarkeit der Theorie und potenzielle Implikationen für Wettbewerb und Regulierung Hanno Beck, Zur Psychologie des Gesundheitswesens: Können die Erkenntnisse der Behavioral Economics die Effizienz des Gesundheitssystems verbessern? Krankenhausmarkt Uta Meurer, Beobachtung der Krankenhausszene in den vergangenen 30 Jahren Christoph Rasche/A. Braun von Reinersdorff, Krankenhäuser im Spannungsfeld von Markt- und Versorgungsauftrag Markus Möstl, Krankenhausrecht als Regulierungsrecht? Volker Ulrich/A. Schmid, Dynamik und Ordnung - Strukturen der medizinischen Leistungserbringung am Beispiel des stationären Sektors

Stuttgart

Werten und Wissen Beiträge zur politischen Ökonomie von Hans Willgerodt Marktwirtschaftliche Reformpolitik Bd. 11 (Herausgegeben von Rolf Hasse und Joachim Starbatty) 2011. XVIII/474 S.r geb. € 59,-. ISBN 978-3-8282-0534-5 Dieser Band enthält nur einige der vielen unverändert immer wieder lesenswerten und aufrüttelnden Beiträge Willgerodts zur wirtschafts- und gesellschaftspolitischen Diskussion. Die Lektüre belegt: Das Stemmen gegen den Zeitgeist lohnt sich und das Kämpfen gegen den Strom sollte eine eherne Aufgabe des Wissenschaftlers sein. Die Beiträge sind frisch geblieben.

Inhaltsübersicht: Erster Teil: Werte, Freiheit und Ordnung Christliche Ethik und wirtschaftliche Wirklichkeit Wirtschaftsfreiheit als moralisches Problem Die Gesellschaftliche Aneignung privater Leistungserfolge als Grund-element der wettbewerblichen Marktwirtschaft Grenzmoral und Wirtschaftsordnung Rang und Grenzen der Wirtschaftsfreiheit im Streit der Fakultäten: Rechtswissenschaft, Medizin und Naturwissenschaften Soziale Marktwirtschaft - ein unbestimmter Begriff? Wirtschaftsordnung und Staatsverwaltung Westdeutschland auf dem Wege zu „richtigen" Preisen nach der Reform von 1948 Einigkeit und Recht und Freiheit Der Staat und die Liberalen

LUCIUS LUCIUS

Demokratisierung der Wirtschaft und die Freiheit des Einzelnen Der Bürger zwischen Selbstverantwortung und sozialer Entmündigung

Zweiter Teil: Vertrauen, Irrtum und Wissen als wirtschaftspolitische Probleme Diskretion als wirtschaftspolitisches Problem Der Unternehmer zwischen Verlust, Gewinn und Gemeinwohl Regeln und Ausnahmen in der Nationalökonomie Das Problem des politischen Geldes Gedeckte und ungedeckte Rechte Sozialpolitik und die Inflation ungedeckter Rechte Enteignung als ordnungspolitisches Problem Die Anmaßung von Unwissen Die Universität als Ordnungsproblem

J L

WTT STUTTGART

Jahrbücher für Nationalökonomie und Statistik Journal of Economics and Statistics

Begründet von

Bruno Hildebrand

Fortgeführt von

Johannes Conrad, Ludwig Elster Otto v. Zwiedineck-Siidenhorst Gerhard Albrecht, Friedrich Lütge Erich Preiser, Knut Borchardt Alfred E. Ott und Adolf Wagner

Herausgegeben von

Peter Winker, Wolfgang Franz Werner Smolny, Peter Stahlecker Adolf Wagner, Joachim Wagner

Band 2 3 1

Lucius & Lucius Stuttgart 2 0 1 1

© Lucius & Lucius Verlagsgesellschaft m b H · Stuttgart - 2 0 1 1 Alle Rechte vorbehalten Satz: Mitterweger & Partner Kommunikationsgesellschaft mbH, Plankstadt Druck und Bindung: Neumann Druck, Heidelberg Printed in Germany

Jahrbücher f. Nationalökonomie u. Statistik (Lucius & Lucius, Stuttgart 2011 ) Bd. (Vol.) 231/5+6

Inhalt des 231. Bandes Abhandlungen/Original Papers Auspurg, Katrin, Thomas Hinz, What Fuels Publication Bias? Theoretical and Empirical Analyses of Risk Factors Using the Caliper Test Bauer, Johannes, Jochen Gross, Difficulties Detecting Fraud? The Use of Benford's

636-660

Law on Regression Tables Bettoli, Simone, Herbert Brücker, Extending the Case for a Beneficial Brain Drain Böhme, Enrico, Christopher Müller, Searching for the Concentration-Price Effect in the German Movie Theater Industry Bönke, Timm, Carsten Schroder, Poverty in Germany - Statistical Inference and Decomposition Boneberg, Franziska, The Economic Consequences of One-third Co-determination in German Supervisory Boards Braakmann, Nils, Joachim Wagner, Product Diversification and Profitability in German Manufacturing Firms Carstensen, Kai, Klaus Wohlrabe, Christina Ziegler, Predictive Ability of Business Cycle Indicators under Test Coutts t , Elisabeth, Ben Jann, Ivar Krumpal, Anatol-Fiete Näher, Plagiarism in Student Papers: Prevalence Estimates Using Special Techniques for Sensitive Questions

733-748 466-478

Diekmann, Andreas, Are Most Published Research Findings False? Dohm, Roland, Christoph M. Schmidt, Information or Institution? On the Determinants of Forecast Accuracy Franzen, Axel, Dominikus Vogl, Pitfalls of International Comparative Research: Taking Acquiescence into Account Greiser, Eberhard, One-eyed Epidemiologic Dummies at Nuclear Power Plants. A Reply to Walter Krämer and Gerhard Arminger's Paper '"True Believers" or Numerical Terrorism at the Nuclear Power Plant' Halbleib, Roxana, Valeri Voev, Forecasting Multivariate Volatility using the VARFIMA Model on Realized Covariance Cholesky Factors Heer, Burkhard, Alfred Maußner, Value Function Iteration as a Solution Method for the Ramsey Model Heuson, Clemens, Purchasing-power-dependent Preferences as a New Explanation of Giffen Behaviour: A Note Hofer, Helmut, Torsten Schmidt, Klaus Weyerstrass, Practice and Prospects of Medium-term Economic Forecasting Hohendanner, Christian, Ein-Euro-Jobs und reguläre Beschäftigung One-Euro-Jobs and Regular Employment Jeßberger, Christoph, Maximilian Sindram, Markus Zimmer, Global Warming Induced Water-Cycle Changes and Industrial Production - A Scenario Analysis for the Upper Danube River Basin Kappler, Marcus, Business Cycle Co-movement and Trade Intensity in the Euro Area: is there a Dynamic Link? Krämer, Walter, Gerhard Arminger, „True Believers" or Numerical Terrorism at the Nuclear Power Plant Krieger, Tim, Stefan Traub, Wie hat sich die intragenerationale Umverteilung in der staatlichen Säule des Rentensystems verändert?

628-635

479-493 178-209 440-457 326-335 82-106 749-760

9-27 761-782

621-627 134-152 494-515 516-521 153-171 210-246 415-439 247-265 608-620

IV · Inhalt des 231. Bandes

Has Intragenerational Redistribution Become Less Important in Pension Systems' Public Pillar? Krüger, Fabian, Frieder Mokinski, Winfried Pohlmeier, Combining Survey Forecasts and Time Series Models: The Case of the Euribor Lenza, Michele, Thomas Warmedinger, A Factor Model for Euro-area Short-term Inflation Analysis Lütkepohl, Helmut, Forecasting Nonlinear Aggregates and Aggregates with Timevarying Weights Opp, Karl-Dieter, The Production of Historical "Facts": How the Wrong Number of Participants in the Leipzig Monday Demonstration on October 9, 1989 Became a Convention Petrick, Sebastian, Katrin Rehdanz, Ulrich J. Wagner, Energy Use Patterns in German Industry: Evidence from Plant-level Data Röder, Norbert, Stefan Kilian, Which Parameters Determine the Development of Farm Numbers in Germany? Schiersch, Alexander, Jens Schmidt-Ehmcke, Is the Boone-Indicator Applicable? Evidence from a Combined Data Set of German Manufacturing Enterprises . . Schräpler, Jörg-Peter, Benford's Law as an Instrument for Fraud Detection in Surveys Using the Data of the Socio-Economic Panel (SOEP) Schumacher, Christian, Forecasting with Factor Models Estimated on Large Datasets: A Review of the Recent Literature and Evidence for German GDP. . Shikano, Susumu, Verena Mack, When Does the Second-Digit Benford's Law-Test Signal an Election Fraud? Facts or Misleading Test Results Stahn, Kerstin, Changes in Import Pricing Behaviour: Evidence for Germany . . . Verardi, Vincenzo, Joachim Wagner, Robust Estimation of Linear Fixed Effects Panel Data Models with an Application to the Exporter Productivity Premium Weiß, Bernd, Michael Wagner, The Identification and Prevention of Publication Bias in the Social Sciences and Economics

266-287 63-81 50-62 107-133

598-607 379-414 358-378 336-357 685-718 28-49 719-732 516-521 522-545 661-684

Diskussionsbeiträge / Discussion Papers Meyer, Dirk, Kosten des Europäischen Finanzstabilisierungsmechanismus (EFSM) aus deutscher Sicht The Costs of the European Financial Stability Facility (EFSF) - The German Point of View Schöbet, Enrico, Finanzverwaltung von innen: Neue Ansätze ihrer empirischökonomischen Erforschung

288-303 558-571

Literaturbeiträge/Review Papers Wagner, Adolf, Fortgeschrittene Evolutorische Ökonomik Advanced Evolutionary Economics

304-313

Inhalt des 231. Bandes · V

Buchbesprechungen / Book Reviews Aoyama, H., Y. Fujiwara, Y Ikeda, H. lyetomi, W. Souma, Econophysics and Companies. Statistical Life and Death in Complex Business Networks Bourg, Jean François, Jean Jaques Gouguet, The Political Economy of Professional Sport de Bandt, Olivier, Thomas Knetsch, Juan Peñalosa, Francesco Zeno (eds.), Housing Markets in Europe - A Macroeconomic Perspective Emmett, Ross B. (ed.), The Elgar companion to the Chicago School of Economics Klein, Michael W., Jay C. Shambaugh, Exchange Rate Regimes in the Modern Era Konrad, Kai Α., Tim Lohse (Hrsg.), Einnahmen- und Steuerpolitik in Europa: Herausforderungen und Chancen Konrad, Kai Α., Holger Zschäpitz, Schulden ohne Sühne? Warum der Absturz der Staatsfinanzen uns alle trifft Mateus, Abel M., Teresa Moreira (eds.), Competition Law And Economics: Advances in Competition Policy Enforcement in the EU and North America Nolte, Sandra, Measurement Errors in Nonlinear Models Ohr, Renate (Hrsg.), Governance in der Wirtschaftspolitik Pies, Ingo, Martin Leschke (Hrsg.), Douglass Norths ökonomische Theorie der Geschichte Postler, Andreas, Nachhaltige Finanzierung der Gesetzlichen Krankenversicherung Ramser, Hans ]., Manfred Stadler (Hrsg.), Marktmacht Schöbel, Enrico, Steuerehrlichkeit - Eine politisch-ökonomische und zugleich finanzsoziologische Analyse der Einkommensteuerrechtsanwendung und -befolgung in Deutschland Skedinger, Per, Employment Protection Legislation - Evolution, Effects, Winners and Losers Somaggio, Gabriele, Start mit Hindernissen. Eine theoretische und empirische Analyse der Ursachen von Arbeitslosigkeit nach der dualen Berufsausbildung . . . . Theurl, Theresia (Hrsg.), Wirtschaftspolitische Konsequenzen der Finanz- und Wirtschaftskrise Vogel, Harold L., Financial Market Bubbles and Crashes Weiß, Mirko, Zur Geldpolitik im Euro-Währungsraum: Beschreibung, Auswirkung und Ursachenanalyse von Inflationsunterschieden Wickstrom, Bengt-Arne (Hrsg.), Finanzpolitik und Unternehmensentscheidung . .

783 458 572 574 459 576 580 582 583 584 314 784 785

316 172 587 588 173 462 175

VI · Gutachter zum 231. Jahrgang (2011)

Die Gutachter zum 231. Jahrgang der Jahrbücher für Nationalökonomie und Statistik (01.01.2011 bis 31.10.2011) Im Namen der Herausgeber danke ich allen Wissenschaftlerinnen und Wissenschaftlern, die in diesem Zeitraum bereit waren, für die Jahrbücher für Nationalökonomie und Statistik Manuskripte zu begutachten. Mit ihrer Hilfe sind wir dem Ziel, eine möglichst schnelle Entscheidung über die Publikation der Einreichungen herbeizuführen, ziemlich nahe gekommen. Die Autoren konnten die detaillierten Verbesserungsvorschläge aufnehmen, und davon hat die Qualität der Manuskripte stark profitiert.

Peter Winker Adams, Michael, Universität Hamburg Balázs, Egert, OECD, Paris Bauer, Johannes, LMU München Baumgärtner, Stefan, Leuphana Universität Lüneburg Belke, Ansgar, Universität Duisburg-Essen Berger, Roger, Universität Leipzig Bode, Eckhardt, IfW Kiel Boenke, Timm, Freie Universität Berlin Bohl, Martin, Westfälische Wilhelms-Universität Münster Bortis, Heinrich, Universität Freiburg Braakmann, Nils, Newcastle Business School Bräuninger, Michael, HWWI Hamburg Bredl, Sebastian, Universität Gießen Breitung, Jörg, Universität Bonn Breustedt, Gunnar, Universität Kiel Bruderer-Enzler, Heidi, Ε Τ Η Zürich Brüderl, Josef, Universität Mannheim Brülhart, Marius, H E C Lausanne Brunner, Johannes, Johannes Kepler Universität Linz Büttner, Thiess, Universität Erlangen-Nürnberg Cizek, Pavel, Tilburg University Cleveland, Gordon H., University of Toronto Clots-Figueras, Irma, Universidad Carlos III de Madrid Corrado, Carol Α., The Conference Board, New York Croux, Christophe, ORSTAT, Katholieke Universiteit Leuven D'Agostino, Antonello, European Central Bank Drechsel, Katja, IW Halle Dreger, Christian, D I W Berlin Drichoutis, Andreas, University of Ioannina Eickelpasch, Alexander; D I W Berlin Fendei, Ralf, WHU, Otto Beisheim School of Management, Vallendar Fischer, Andreas, Swiss National Bank, Zürich Fische^ Christoph, Deutsche Bundesbank Fisher, Eric, California Polytechnic State University Flaig, Gebhard, L M U München Franzen, Axel, Universität Bern Frick, Bernd, Universität Paderborn Fritsche, Ulrich, Universität Hamburg Fritzer, Friedrich, OeNB, Wien

Görlitz, Katja, RWI Essen Gross, Jochen, LMU München Halvorsen, Elin, Statistics Norway, Oslo Hautsch, Nikolaus, Humboldt-Universität zu Berlin Hefecker, Carsten, Universität Siegen Heilemann, Ullrich, Universität Leipzig Heinbach, Wolf-Dieter, Ministerium für Wissenschaft, Forschung und Kunst BadenWürttemberg Heineck, Guido, Universität Bamberg Henke, Klaus-Dirk, Technische Universität Berlin Hillebrand, Marten, KIT Karlsruhe Hinz, Thomas, Universität Konstanz Höglinger, Marc, Ε 1 Ή Zürich Hogrefe, Jens, IfW Kiel Holm, Hakan Jerker, Lund University Hübler, Olaf, Leibniz Universität Hannover Jann, Ben, Universität Bern Janz, Norbert, FH Aachen Jensen, Uwe, Universität Kiel Jochem, Axel, Deutsche Bundesbank, Frankfurt a. M . Jung, Benjamin, Universität Hohenheim Jüßen, Falko, Technische Universität Dortmund Kaiser, Ulrich, Universität Zürich Kaplan, Todd R., University of Exeter Kappler, Marcus, Z E W Mannheim Ketzel, Eberhardt, Sankt Augustin Kirstein, Roland, Otto-von-Guericke-Universität Magdeburg Knabe, Andreas, Freie Universität Berlin Kölling, Arnd, Hochschule der Bundesagentur für Arbeit, Schwerin Königstein, Manfred, Universität Erfurt Krämer, Walter, Universität Dortmund Kreyenfeld, Michaela, Universität Rostock Lenza, Michele, EZB, Frankfurt a . M . Liebe, Ulf, Universität Göttingen List, John Α., University of Chicago Manganelli, Simone, EZB, Frankfurt a. M . Maussner, Alfred, Universität Augsburg Meckl, Jürgen, Universität Gießen Mellander, Erik, Institute for Labour Market Policy Evaluation, Uppsala Meyer, Mark, GWS Osnabrück

Gutachter zum 231. Jahrgang (2011) · VII

Mühlbacher, Axel, Hochschule Neubrandenburg Murphy, Ryan, ΕΤΗ Zürich Neels, Karel, University of Antwerp Neubaues Günter, IFG Inst. f. Gesundheitsökonomik GbR, München Neusser, Klaus, Universität Bern Niebuhr, Annekatrin, Universität Kiel Nolte, Ingmar, Warwick Business School Norton, Edward C., University of Michigan Osborn, Denise, University of Manchester Osterloh, Steffen, ZEW Mannheim Pérez-Luño, Ana, Universidad Pablo de Olavide, Sevilla Peters, Heiko, Statistisches Bundesamt, Wiesbaden Petrick, Sebastian, IfW Kiel Pfaffermayr, Michael, Universität Innsbruck Pfajfar, Damjan, Tilburg University Pohlmeier, Winfried, Universität Konstanz Prantl, Susanne, Universität Köln Przepiorka, Wojtek, Nuffield College, Oxford Pyka, Andreas, Universität Hohenheim Ragnitz, Joachim, ifo Dresden Rahmeyer, Fritz, Universität Augsburg Ramser, Hans Jürgen, Universität Konstanz Rippin, Franziska, LSKN Niedersachsen, Hannover Rizzo, Leonzio, Università di Ferrara Rodríguez-Pose, Andrés, London School of Economics Rondinelli, Concetta, Bank of Italy, Rom Rünstler, Gerhard, Österreichisches Institut für Wirtschaftsforschung Savin, Ivan, Universität Gießen Schänk, Thorsten, Johannes Gutenberg-Universität Mainz Scherf, Wolfgang, Universität Gießen Schiersch, Alexander DIW Berlin Schmidt, Christoph, RWI Essen Schnabel, Claus, Universität Erlangen-Nürnberg Schnabl, Gunther; Universität Leipzig Schnellenbach, Jan, Universität Heidelberg Schräpler, Jörg-Peter, Universität Bochum

Schrimpf, Andreas, BIS Basel Schumacher, Christian, Deutsche Bundesbank, Frankfurt a . M . Schwiebert, Jörg, Leibniz Universität Hannover Shikano, Susumu, Universität Konstanz Siedler, Thomas, DIW Berlin Siliverstovs, Boriss, KOF, ΕΤΗ Zürich Spieß, Katharina, Freie Universität Berlin Stadtmann, Georg, Europa-Universität Viadrina, Frankfurt (Oder) Stephan, Andreas, Jönköping International Business School Stevens, Arnoud, Universität Gent Tamm, Marcus, RWI Essen Thomsen, Stephan, Otto-von-Guericke-Universität Magdeburg Thöni, Christian, Universität St. Gallen Tillmann, Peter, Universität Gießen Trappe, Heike, Universität Rostock Vieth, Manuela, ΕΤΗ Zürich Vogel, Alexander, Statistik Nord, Kiel Vogel, Edgar, MEA Mannheim Waelbroeck, Patrick, Ecole nationale superiéure des telecommunications, Paris Wagner, Michael, Universität Köln Welcker, Johannes, Universität des Saarlandes Westerheide, Peter, ZEW Mannheim Witt, Ulrich, Max-Planck-Institut für Ökonomik, Jena Wolf, Elke, Hochschule München Wolf, Nikolaus, Humboldt-Universität zu Berlin Wolff, Joachim, IAB Nürnberg Wrede, Matthias, Universität Erlangen-Nürnberg Wrohlich, Katharina, DIW Berlin Ziegler, Andreas, Universität Kassel Zimmer, Markus, ifo München Zimmermann, Volker, KfW Bankengruppe, Frankfurt a . M . Zweimüller, Martina, Johannes Kepler Universität Linz Zwick, Thomas, LMU München