1,825 245 15MB
English Pages 492 [509] Year 2012
FIELD EXPERIMENTS Design, Analysis, and Interpretation Alan S. Gerber Donald P. Green
YA LE U N IV E R S IT Y
C O L U M B IA U N IV E R S IT Y
TXT W. W. N ORT ON & C O M P A N Y
NEW YORK • LONDON
W. W. Norton & C o m p a n y h as been independent since its founding in 1923, w hen William W arder Norton and M ary D. Herter Norton first published lectures delivered at the People’s Institute, the adult education division of N ew York City's Coop e r Union. The firm soon expanded its p rogra m beyond the Institute, publishing b ooks by celebrated a ca d e m ics from A m e rica and abroad. By midcentury, the two major pillars of N o rto n 's publishing p r o g r a m trade b ooks and college texts— were firmly established. In the 1950s, the Norton family transferred control of the com pa ny to its employees, and today— with a staff of four hundred and a com parable n u m b e r of trade, college, and professional titles published each ye ar— W. W. Norton & C o m p a n y sta n d s a s the largest and oldest publishing h ouse owned wholly by its employees.
Editor: A nn Shin Associate Editor: Jake Schindel Project Editor: Jack Borrebach M arketing Manager, political science: Sasha Levitt Production Manager: Eric Pier-Hocking Text Design: Joan Greenfield / Gooddesign Resource Design Director: Hope M iller Goodell Com position by Jouve International—Brattleboro, VT M anufacturing by the M aple Press—York, PA Copyright © 2012 by W. W. N orton & Company, Inc. All rights reserved. Printed in the U nited States of America. First edition.
Library of Congress Cataloging-in-Publication Data Gerber, Alan S. Field experim ents : design, analysis, and interpretation / Alan S. Gerber, Donald P. Green. — 1st ed. p. cm. Includes bibliographical references and index. ISBN 978-0-393-97995-4 (pbk.) 1. Political science—Research—Methodology. 3. Political science—Study and teaching (Higher)
2. Social science—Research—Methodology. 4. Social science—Study and teaching (Higher)
I. Green, D onald P., 1961-11. Title. JA86.G36 2012 001.4'34—dc23 2011052337 W. W. N orton & Company, Inc., 500 Fifth Avenue, New York, NY 10110-0017 w w norton.com W. W. N orton & C om pany Ltd., 15 Carlisle Street, London W 1D 3BS 7 8 9 0
THIS
B O O K IS D E D I C A T E D
HELPED
INSTILL
TO O U R
PARENTS,
WHO
IN US A L OVE OF S C I E N C E .
CONTENTS
PREFACE
xv
C H A P T E R 1 Introduction
1.1
D raw ing Inferences from Intuitions, A necdotes, and C orrelations
2
E xperim ents as a Solution to th e Problem o f U nobserved C onfounders
5
1.3
E xperim ents as Fair Tests
7
1.4
Field E xperim ents
8
1.5
A dvantages an d D isadvantages o f E xperim enting in R eal-W orld Settings
13
1.6
N aturally O ccu rrin g E xperim ents an d Q uasi-E x p erim en ts
15
1.7
Plan o f the Book
17
Suggested Readings
18
Exercises
18
1.2
CHAPTER 2
1
C a u sa l Inference and Experim entation
21
2.1
Potential O utcom es
21
2.2
Average T reatm ent Effects
23
2.3
R andom Sam pling and E xpectations
26
2.4
R andom A ssignm ent an d U nbiased Inference
30
2.5
The M echanics o f R andom A ssignm ent
36
2.6
The Threat o f Selection Bias W h en R andom A ssignm ent Is N ot U sed
37
Two C ore A ssum ptions ab o u t Potential O utcom es
39
2.7.1
Excludability
39
2.7.2
N on-Interference
43
2.7
vii
viii
CONTENTS
CHAPTER 3
Sum m ary
44
Suggested Readings
46
Exercises
46
Sam plin g Distributions, Statistical Inference, and Hypothesis Testing
51
3.1
Sampling D istributions
52
3.2
The Standard E rror as a M easure o f U ncertainty
54
3.3
Estim ating Sampling Variability
59
3.4
H ypothesis Testing
61
3.5
Confidence Intervals
66
3.6
Sampling D istributions for Experim ents That Use Block or C luster R andom A ssignm ent
71
3.6.1
71
3.6.2
CHAPTER 4
Block R andom A ssignm ent 3.6.1.1
M atched Pair Design
77
3.6.1.2
Sum m ary o f the Advantages and D isadvantages o f Blocking
79
C luster R andom A ssignm ent
80
Sum m ary
85
Suggested Readings
86
Exercises
86
A p p e n d i x 3.1: P o w e r
93
U sing Covariates in Experim ental Design and A n alysis
95
4.1
Using Covariates to Rescale O utcom es
96
4.2
A djusting for Covariates Using Regression
102
4.3
Covariate Im balance and the D etection of A dm inistrative Errors
105
4.4
Blocked R andom ization and Covariate A djustm ent
109
4.5
Analysis of Block R andom ized Experim ents w ith Treatm ent Probabilities That Vary by Block
116
Sum m ary
121
Suggested Readings
123
Exercises
123
CONTENTS
CHAPTER
5 O ne-Sided N oncom pliance
131
5.1
New D efinitions an d A ssum ptions
134
5.2
D efining C ausal Effects for the Case o f O ne-Sided N oncom pliance
137
5.2.1 5.2.2 5.3
CHAPTER
ix
The N on-Interference A ssum ption for E xperim ents That E n co u n ter N oncom pliance
138
The Excludability A ssum ption for O ne-Sided N oncom pliance
140
Average T reatm ent Effects, In ten t-to -T reat Effects, and C om plier Average C ausal Effects
141
5.4
Identification o f the CACE
143
5.5
E stim ation
149
5.6
Avoiding C o m m o n M istakes
152
5.7
Evaluating the A ssum ptions Required to Identify the CACE
155
5.7.1
N on-Interference A ssum ption
155
5.7.2
Exclusion R estriction
156
5.8
Statistical Inference
157
5.9
D esigning E xperim ents in A n ticipation o f N oncom pliance
161
5.10 E stim ating T reatm ent Effects W h en Som e Subjects Receive “Partial T reatm ent”
164
Sum m ary
165
Suggested Readings
167
Exercises
168
6 Tw o-Sided Noncom pliance
6.1
173
Two-Sided N oncom pliance: N ew D efinitions an d A ssum ptions
175
6.2
ITT, IT T D, and CACE u n d e r Tw o-Sided N oncom pliance
179
6.3
A N um erical Illustration o f th e Role o f M onotonicity
181
6.4
E stim ation o f th e CACE: A n Exam ple
185
6.5
D iscussion o f A ssum ptions
189
6.5.1
M onotonicity
190
6.5.2
Exclusion R estriction
191
6.5.3
R andom A ssignm ent
192
6.5.4
D esign Suggestions
192
X
CONTENTS
6.6
CHAPTER 7
D ow nstream Experim entation
193
Sum m ary
204
Suggested Readings
206
Exercises
206
Attrition
211
7.1
C onditions U nder W hich A ttrition Leads to Bias
215
7.2
Special Form s o f A ttrition
219
7.3
Redefining the E stim and W hen A ttrition Is N ot a Function of T reatm ent A ssignm ent
224
7.4
Placing B ounds on the Average Treatm ent Effect
226
7.5
A ddressing A ttrition: An Em pirical Example
230
7.6
A ddressing A ttrition w ith A dditional D ata Collection
236
7.7
Two Frequently Asked Q uestions
241
Sum m ary
243
Suggested Readings
244
Exercises
244
A p p e n d i x 7.1: Opt im al S a m p l e Allocation for Second-Round Sampling CHAPTER 8
Interference between Experim ental Units 8.1 8.2
253
Identifying Causal Effects in the Presence of Localized Spillover
256
Spatial Spillover
260
8.2.1
CHAPTER 9
248
Using N onexperim ental U nits to Investigate Spillovers
264
8.3
An Example of Spatial Spillovers in Two D im ensions
264
8.4
W ithin-Subjects D esign and Tim e-Series Experim ents
273
8.5
W aitlist Designs (Also K now n as Stepped-W edge Designs)
276
Sum m ary
281
Suggested Readings
283
Exercises
283
Heterogeneous Treatment Effects
289
9.1
Limits to W hat E xperim ental D ata Tell Us about Treatm ent Effect H eterogeneity
291
CONTENTS
9.2
B ounding Var
9.3
Two A pproaches to th e E xploration o f H eterogeneity: C ovariates an d D esign
296
9.3.1
A ssessing T reatm ent-by-C ovariate Interactions
296
9.3.2
C aution Is R equired W h en In terp retin g T reatm ent-by-C ovariate In teractio n s
299
A ssessing T reatm ent-by-T reatm ent Interactions
303
9.3.3
(t )
an d Testing for H eterogeneity
xi
292
9 .4
Using R egression to M odel T reatm ent Effect H eterogeneity 305
9.5
A utom ating the Search for Interactions
310
Sum m ary
310
Suggested Readings
312
Exercises
313
C H A P T E R 10 Mediation
319
10.1 Regression-B ased A pproaches to M ediation
322
10.2 M ediation A nalysis from a Potential O utcom es Perspective
325
10.3 W hy E xperim ental Analysis o f M ediators Is C hallenging
328
10.4- Ruling O u t M ediators?
330
10.5 W hat about E xperim ents That M anipulate th e M ediator?
331
10.6 Im plicit M ediation A nalysis
333
Sum m ary
336
Suggested Readings
338
Exercises
338
A p p e n d i x 10.1: T r e a t m e n t P o s t c a r d s M a il ed to M ichigan H o u se h o ld s
C H A P T E R 11 Integration of Research Fin d in gs
343
347
11.1 E stim ation o f Population Average T reatm ent Effects
350
11.2 A Bayesian F ram ew ork for In terp retin g Research Findings
353
11.3 R eplication an d Integration o f E x perim ental Findings: An Exam ple
358
11.4 Treatm ents That Vary in Intensity: E xtrapolation an d Statistical M odeling
366
xii
CONTENTS
Sum m ary
377
Suggested Readings
378
Exercises
379
C H A P T E R 12 Instructive Exam ples of Experim ental Design
383
12.1 Using E xperim ental Design to D istinguish betw een C om peting Theories
384
12.2 O versam pling Subjects Based on Their A nticipated Response to T reatm ent
387
12.3 C om prehensive M easurem ent o f O utcom es
393
12.4 Factorial D esign and Special Cases o f N on-Interference
395
12.5 D esign and Analysis of Experim ents In W hich Treatm ents Vary w ith Subjects’ C haracteristics
400
12.6 D esign and Analysis of Experim ents In W hich Failure to Receive Treatm ent Has a Causal Effect
406
12.7 A ddressing C om plications Posed by M issing D ata
410
Sum m ary
414
Suggested Readings
415
Exercises
416
C H A P T E R 13 W riting a Proposal, Research Report,
and Journal Article
425
13.1 W riting the Proposal
426
13.2 W riting the Research R eport
435
13.3 W riting the Journal A rticle
440
13.4 A rchiving D ata
442
Sum m ary
444
Suggested Readings
445
Exercises
445
A P P E N D I X A Protection of Hum an Subjects
447
A. 1 Regulatory G uidelines
447
A. 2 G uidelines for Keeping Field E xperim ents w ithin Regulatory B oundaries
449
CONTENTS
A P P E N D I X B Su gge ste d Field Experim ents for C la ss Projects
xiii
453
B. 1 C rafting Your O w n E xperim en t
453
B. 2 Suggested Experimental Topics for Practicum Exercises
455
REFERENCES
461
IN D E X
479
PREFACE
o r m ore th a n a decade, we have taught a one-sem ester course on experim ental
F
research m ethods to undergraduate and graduate students in the social sciences. A lthough readings and discussion som etim es address experim ents conducted
in the lab, the course focuses p rim arily on “field” experim ents, studies co n d u cted in
n atu ral settings in w hich subjects are allocated ran d o m ly to treatm en t and control groups. Students read research articles th a t illustrate key principles o f experim ental design o r analysis, and class tim e is devoted to explaining these principles. Students often find the m aterial engaging and even inspiring, b u t the fact th at th ey read selec tions from a broad research literature rath er th an a tex tb o o k m eans th at even very ta l ented students frequently fail to assim ilate im p o rtan t term s, concepts, an d techniques. O u r aim in w riting this b o o k is to provide a system atic in tro d u ctio n to ex p eri m en tatio n th at also conveys th e excitem ent o f en co u n terin g an d co n d u ctin g p rim ary research. Each chapter weaves abstract principles to g eth er w ith exam ples draw n from a w ide range o f social science disciplines: crim inology, econom ics, education, p o liti cal science, social psychology, an d sociology. The exercises at th e en d o f each chapter invite students to reflect on abstract problem s o f research design an d to analyze data from (or inspired by) im p o rtan t experim ents. O u r aim is to alert readers to th e vast range o f experim ental applications and o p p o rtu n ities for future investigation. D eveloping expertise as an experim ental researcher is p a rt technical train in g and p a rt apprenticeship. The form er requires the read er to th in k ab o u t ex p erim en ta tio n in abstract term s. W hat inferences can be draw n fro m an experim ent, an d u n d e r w hat conditions m ight these inferences be jeopardized? A ny explanation o f abstract principles m ust inevitably invoke statistical term inology, because the language o f sta tistics brings precision an d generality. The p resen tatio n in th is b o o k presupposes th at the reader has at som e p o in t taken a one- or tw o -sem ester in tro d u ctio n to statistical inference an d regression. R ecognizing th at th e re a d e rs m e m o ry o f statistical p rin ciples m ay be hazy, the b o o k continually defines, explains, and illustrates. In an effort to m ake the presentation accessible, we have freely re n am ed arcane term s o f a rt.1 1 O u r aim th ro u g h o u t is to use nam ing conventions th at convey the intuition behind the idea or proce dure. The term “external validity, ” for example, is replaced by “generalizability. ” We also depart from the academ ic convention o f using scholars nam es to refer to ideas o r procedures. The term “extrem e value bounds,” for example, replaces the term “M anski b o u n d s” References are provided so that originators of key ideas receive appropriate credit.
XV
xvi
PREFACE
We have also sidestepped m ost o f the standard form ulas used to conduct h y p o th esis tests and to construct confidence intervals in favor o f a unified fram ew ork that relies on statistical sim ulation. This fram ew ork no t only m akes the presentation m ore systematic, it also m akes the book m ore concise—from a few core principles, one can deduce a large n um ber of design recom m endations that w ould otherw ise require hu ndreds of pages to explicate. O u r years in the classroom suggest that presentation o f abstract principles needs to be reinforced by instructive examples and h an d s-o n experience. It is one thing to m em orize key assum ptions and quite another to be able to recognize w hich assum p tions com e into play in a given application. In an effort to develop this skill, the exer cises at the end o f each chapter introduce a wide array o f experim ents and invite readers to reflect on issues of design, analysis, and interpretation. C hapter 12 fu rth er illustrates principles laid out in earlier chapters by offering a close reading o f several im p o rtan t field experim ents. The text and exercises are designed to prepare readers for the challenge o f devel oping and im plem enting their ow n experim ental projects. In o u r courses, we require students to conduct their ow n field experim ents because the experience forces them to link statistical concepts to the specifics o f th eir application, im parts valu able lessons about planning and im plem entation, and m akes them m ore perceptive readers o f other researchers’ work. We urge instructors to assign a sm all-scale field experim ent in order to solidify students u n derstanding o f how to fram e a testable hypothesis, allocate subjects to experim ental conditions, and contend w ith com plica tions such as attrition or noncom pliance. A lthough field experim ents are som etim es dism issed as prohibitively expensive, difficult, or ethically encum bered, experience shows that a wide variety of field experim ental studies m ay be conducted w ith lim ited resources and m inim al risk to h u m an subjects. Sample topics include applying for jobs, searching for apartm ents, asking for assistance, fundraising, tipping, tu to r ing, dieting, petitioning, advertising, and exercising. In A ppendix B we suggest field experim ents (and accom panying readings) th at students m ay use as inspiration for term papers or capstone projects. A lthough designed for a stand-alone course on experim ental research, this book m ay also be used as a supplem entary text for courses on research design, causal infer ence, or applied statistics. C hapters 1 thro u g h 4 provide a concise in troduction to core concepts, such as potential outcom es, sam pling distributions, and statistical inference. C hapters 5 through 11 cover m ore advanced topics, such as noncom pli ance, attrition, interference, m ediation, and m eta-analysis. In an effort to m ake the b ook accessible, each chapter supplies plenty o f w orked examples; for those seek ing additional technical details, we furnish a list o f suggested readings at the end o f each chapter. Supplem entary m aterials at http://isps.research.yale.edu/FED A I provide readers w ith data and com puter code to perform all analyses and sim ula tions. The code for all o f the b o o k s examples has been w ritten using the free software package R, so that readers from all over the w orld can use the statistical procedures
PREFACE
xvii
we d em onstrate at no cost. Readers are encouraged to visit th e W eb site for su p p le m en tary m aterials, updates, an d errata. This b o o k is in m any ways a collective und ertak in g . The data for the exam ples and exercises were furnished by an ex trao rd in ary array o f scholars: Kevin Arceneaux, Julia Azari, M arianne B ertrand, Rikhil B havnani, Elizabeth C am pbell, D avid C lingingsm ith, Sarah C otterill, R uth D itlm an n , Pascaline D upas, Leslie H ough, Peter John, A sim Ijaz Khwaja, M ichael K rem er, Paul Lagunes, Sendhil M ullainathan, K arthik M uralidharan, D avid N ickerson, Ben O lken, Jeffrey R osen, V enkatesh Sund araram an, Rocio T itiunik, an d Ebonya W ashington. We th a n k D avid Torgerson and Iain C halm ers for suggesting exam ples from th e h isto ry o f m edicine an d ran d o m ized trials. Several o f the chapters draw on o u r collaborations w ith colleagues: C h risto p h e r L arim er (C hapters 6 and 10),2 Betsy Sinclair an d M argaret M cC onnell (C h ap ter 8),3 John Bullock and Shang Ha (C hapter 10),4 and Edw ard K aplan and H olger K ern (C hapters 5, 9 and ll).5 For com m ents on chapter drafts, we are grateful to Josh A ngrist, Kevin A rceneaux, N oah Buckley, John Bullock, D aniel Butler, A na D e La O, Thad D u nning, Brian Fried, G rant G ordon, Justine H astings, Susan Hyde, M acartan H um phreys, Edw ard Kaplan, Jordan Kyle, Paul Lagunes, M alte Lierl, Jason Lyall, Neil M alhotra, D avid N ickerson, Laura Paler, Elizabeth Levy Paluck, Ben Pasquale, Lim or Peer, K enneth Scheve, Betsy Sinclair, Pavithra Suryanarayan, D avid Szakonyi, and Lauren Young. Special thanks go to C yrus Sam ii an d Rocio T itiunik, w ho provided valuable com m ents on the entire m anuscript. The authors w ish to th a n k the talented team o f researchers w ho assisted w ith the p rep aration o f the statistical exam ples an d exercises. Peter A ronow and H olger K ern m ade im p o rtan t technical and substantive co n trib u tio n s to every chapter. Peter Aronow, C yrus Samii, an d N eelan Sircar developed th e R package th at we use th ro u g h o u t th e b o o k to co n d u ct ran d o m izatio n inference. Bibliographic research and m an u scrip t p reparation benefited enorm ously from the w ork o f M ary M cG rath and Josh Kalla, as well as from Lucas Leem ann, M alte Lierl, A rjun Shenoy, an d John W illiams. A llison Sovey an d Paolo Spada assisted in p rep arin g the exercises and solu tions. We are grateful to Lim or Peer an d Alissa Stollwerk, w ho archived the data and program s featured in the book. The In stitu tio n for Social an d Policy Studies at Yale U niversity provided generous su p p o rt for research an d data preservation. We th a n k the editorial team at W. W. N orton, A nn Shin, Jack B orrebach, an d Jake Schindel, for th eir o u tstanding w ork. The authors take full credit for errors and oversights. Finally, we owe a special debt o f gratitude to o u r fam ilies, for th eir su p p o rt, encouragem ent, and patience d u rin g the long process o f w riting, re-w riting, and re-re-w riting.
2 3 4 5
Gerber, Green, and Larim er 2008. Sinclair, M cConnell, and Green 2012. Bullock, Green, and H a 2010; Green, Ha, and Bullock 2010. Gerber, Green, and Kaplan 2004; G reen and K ern 2011; G erber, Green, Kaplan, and Kern 2010.
CHAPTER 1
Introduction aily life continually presents us w ith questions o f cause and effect. W ill eating
D
m ore vegetables m ake m e healthier? If I drive a bit faster th a n the law allows, will the police pull m e over for a speeding ticket? W ill dragging m y relu ctan t
children to m useum s m ake them one day m ore interested in art an d history? Even actions as banal as scheduling a dental exam o r choosing an efficient p ath to w ork draw on cause-and-effect reasoning. O rganizations, too, grapple w ith causal puzzles. C harities try to figure o u t w hich
fundraising appeals w ork best. M arketing agencies look for ways to bo o st sales. C hurches strive to attract congregants on Sundays. Political parties m aneuver to w in elections. Interest groups attem pt to influence legislation. W h eth e r th e ir aim is to bo o st donations, sales, attendance, or political influence, organizations m ake d eci sions based (at least in p art) on th eir u n d e rstan d in g o f cause an d effect. In som e cases, the survival o f an organization depends on th e skill w ith w hich it addresses the causal questions th at it confronts. O f special interest to academ ic researchers are the causal questions th at co nfront governm ents an d policy m akers. W h at are th e econom ic an d social effects o f raising the m in im u m wage? W ould allow ing parents to pay for private school using p u b licly fu n d ed vouchers m ake the educational system m ore effective an d cost-efficient? W ould legal lim its on how m uch candidates can sp en d w h en ru n n in g for office affect the com petitiveness o f elections? In the interest o f prev en tin g bloodshed, should in tern atio nal peacekeeping tro o p s be deployed w ith or w ith o u t heavy w eapons? W ould m an d atin g harsher p u n ish m en ts for violent offenders d eter crim e? A list o f policy-relevant causal questions w ould itself fill a book. An even larger tom e w ould be needed to catalog th e m an y theoretical questions th at are inspired by causal claims. For exam ple, w hen asked to co n trib u te to a collec tive cause, such as cutting dow n on carbon em issions in o rd e r to prevent global cli m ate change, to w hat extent are people responsive to appeals based on social n o rm s or ideology? P rom inent scholars have argued th at collective action will fo u n d er
1
2
INTRODUCTION
unless individuals are given som e sort of rew ard for th eir participation; according to this argum ent, sim ply telling people that they ought to contribute to a collective cause will n o t w ork.1 If this underlying causal claim is true, the consequences for policym aking are profound: tax credits m ay work, b u t declaring a national Climate C hange Awareness D ay will not. W hether because o f their practical, policy, or theoretical significance—o r sim ply because they tran sp o rt us to a different tim e and place—causal claims spark the im agination. H ow does the pilgrim age to M ecca affect the religious, social, and political attitudes o f M uslim s?2 D o high school dro p o u t rates in low -incom e areas im prove w hen children are given m o n etary rew ards for academ ic perform ance?3 Are M exican police m ore likely to dem and bribes from upper- o r lower-class drivers who are pulled aside for traffic infractions?4 Does your race affect w hether em ployers call you for a job interview ?5 In the context o f a civil war, do civilians becom e m ore sup portive o f the governm ent w hen local econom ic conditions im prove?6 Does artillery b o m b ard m en t directed against villages suspected o f harboring insurgent guerril las increase or decrease the likelihood of subsequent insurgent attacks from those villages?7 In short, the w orld is brim m ing over w ith causal questions. H ow m ight one go about answ ering th em in a convincing m anner? W hat m ethods for answ ering causal questions should be viewed w ith skepticism?
1.1
Drawing Inferences from Intuitions, Anecdotes, and Correlations
O ne com m on way o f addressing causal questions is to draw on intuition and anec dotes. In the aforem entioned case o f artillery directed at insurgent villages, a scholar m ight reason that firing on these villages could galvanize support for the rebels, leading to m ore insurgent attacks in the future. B om bardm ent m ight also p ro m p t the rebels to dem onstrate to villagers their determ ination to fight on by escalating th eir insur gent activities. In support of this hypothesis, one m ight po in t out th at the anti-N azi insurgency in Soviet Russia in 1941 becam e m ore determ ined after occupation forces stepped up their m ilitary suppression. O ne problem w ith building causal argum ents
1 2 3 4 5 6 7
O lson 1965. Clingingsm ith, Khwaja, and K rem er 2009. A ngrist and Lavy 2009; see also Fryer 2010. Fried, Lagunes, and V enkataram ani 2010. B ertrand and M ullainathan 2004. Beath, Christia, and Enikolopov 2011. Lyall 2009.
INTRODUCTION
3
aro u n d intuitions an d anecdotes, however, is th at such argum ents can often be a d duced for both sides of a causal claim. In the case o f firing on insurgents, an o th er research er could argue th a t in su rg en ts d e p en d on th e goodw ill o f villagers; once a village is fired upon, villagers have a greater incentive to expel the rebels in o rd er to prev en t fu tu re attacks. Supplies d ry up, an d in fo rm an ts disclose rebel h id eo u ts to governm ent forces. This researcher could defend the arg u m en t by describing th e gov ern m e n t suppression o f the Sanusi uprising in Libya, w hich seem ed to deal a lasting blow to these rebels’ ability to carry o u t insurgen t attacks.8 D ebates based on in tu itio n and anecdotes frequently result in stalem ate. A critique o f anecdote an d intu itio n can be taken a step further. The m e th o d is susceptible to e rro r even w hen in tu itio n an d anecdotes seem to favor ju st one side o f an argum ent. The history o f m edicine, w hich is in structive because it ten d s to provide clearer answ ers to causal questions th a n research in social science, is replete w ith exam ples o f w ell-reasoned hypotheses th a t later proved to be false w hen tested experim entally. C onsider the case o f aortic arrh y th m ia (irregular heartb eat), w hich is often associated w ith h e art attacks. A w ell-regarded th e o ry held th a t arrh y th m ia was a p recu rso r to heart attack. Several drugs were developed to suppress a rrh y th m ia, and early clinical reports seem ed to suggest th e benefits o f restoring a regular heartbeat. The C ardiac A rrh y th m ia Suppression Trial, a large ran d o m ized ex p eri m ent, was launched in the hope o f finding w hich o f th ree suppression drugs w orked best, only to discover that tw o o f the three dru g s p ro d u ced a significant increase in death an d h e art attacks, w hile the th ird h ad negative b u t seem ingly less fatal conse quences.9 The b ro ad er p o int is th at w ell-regarded theories are fallible. This co n cern is particularly acute in the social sciences, w here in tu itio n s are rarely uncontroversial, and controversial intuitions are rarely backed up by conclusive evidence. A n o th er co m m o n research strategy is to assem ble statistical evidence show ing th at an outcom e becom es m ore likely w hen a certain cause is present. R esearchers som etim es go to great lengths to assem ble large datasets th a t allow th em to track the correlation betw een putative causes an d effects. These data m ig h t be used to learn about the following statistical relationship: to w hat extent do villages th a t com e u n d e r attack by governm ent forces te n d to have m o re o r less subsequent insurgent activity? Som etim es these analyses tu rn up ro b u st correlations betw een in te rv e n tions and outcom es. The problem is th a t correlations can be a m isleading guide to causation. Suppose, for exam ple, th at the correlation betw een g o v ern m en t b o m b a rd m en t and subsequent insurgent activity w ere fo u n d to be strongly positive: th e m ore shelling, the m ore subsequent insurgent activity. If in terp reted causally, this co rrela tio n w ould indicate th at shelling p ro m p ted insurgents to step up th e ir attacks. O th e r
8 9
See Lyall 2009 for a discussion o f these debates and historical episodes. Cardiac A rrhythm ia Suppression Trial II Investigators 1992.
11
INTRODUCTION
interpretations, however, are possible. It could be th at governm ent forces received intelligence about an escalation of insurgent activity in certain villages and directed th eir artillery there. Shelling, in other words, could be a m arker for an uptick in in su r gent activity. U nder this scenario, we w ould observe a positive correlation betw een shelling and subsequent insurgent attacks even if shelling per se h ad no effect. The basic problem w ith using correlations as a guide to causality is that correla tions m ay arise for reasons that have nothing to do w ith the causal process un d er investigation. D o SAT preparation courses im prove SAT scores? Suppose there were a strong positive correlation here: people w ho took a prep class on average got higher SAT scores than those w ho did n ot take the prep class. Does this correlation reflect the course-induced im provem ent in scores, or rather the fact th at students w ith the m oney and m otivation to take a prep course ten d to score higher th an th eir less afflu ent o r less m otivated counterparts? If the latter were true, we m ight see a strong asso ciation even if the prep course had no effect on scores. A com m on erro r is to reason th at w here theres smoke, there’s fire: correlations at least h in t at the existence o f a causal relationship, right? N ot necessarily. Basketball players tend to be taller than o th er people, but you cannot grow taller by joining the basketball team . The distinctio n betw een correlation and causation seem s so fundam ental that one m ight w onder w hy social scientists rely on correlations w hen m aking causal argum ents. The answ er is th at the d o m in an t m ethodological practice is to transform raw correlations into m ore refined correlations. After noticing a correlation that m ight have a causal interpretation, researchers attem pt to m ake this causal in terp re tatio n m ore convincing by lim iting the com parison to observations th a t have sim i lar background attributes. For example, a researcher seeking to isolate the effects of th e SAT p rep arato ry course m ight restrict attention to people w ith th e sam e gen der, age, race, grade p o in t average, and socioeconom ic status. The problem is that this m eth o d rem ains vulnerable to unobserved factors th at predict SAT scores and are correlated w ith taking a prep course. By restricting attention to people w ith the sam e socio-dem ographic characteristics, a researcher m akes the people w ho took the course com parable to those w ho did n o t in term s o f observed attributes, but these groups m ay nevertheless differ in ways th at are unobserved. In som e cases, a researcher m ay fail to consider som e o f the factors th at contribute to SAT scores. In o th er cases, a researcher m ay th in k of relevant factors b u t fail to m easure them adequately. For exam ple, people w ho take the prep course may, on average, be m ore m otivated to do well on the test. If we fail to m easure m otivation (or fail to m easure it accurately), it will be one of the u nm easured attributes th at m ight cause us to draw m istaken inferences. These u nm easured attributes are som etim es called confounders o r lurking variables or unobserved heterogeneity. W hen in terp retin g correlations, researchers m ust always be alert to the distorting influence o f u n m easu red a ttri butes. The fact that som eone chooses to take the prep course m ay reveal som ething about how they are likely to perform on the test. Even if the course truly has no
INTRODUCTION
5
effect, people w ith the sam e age, gender, and affluence m ay seem to do b etter w hen they take the course. W h eth er the problem o f unobserved confounders is severe o r in n o cu o u s will d ep en d on the causal question at h a n d and the m a n n er in w hich back g ro u n d a ttri butes are m easured. C onsider the so-called “broken w indow s” theory, w hich suggests th at crim e increases w hen blighted areas appear to be ab an d o n ed by p ro p erty ow ners and u n supervised by police.10 The causal question is w h eth er one could reduce crim e in such areas by picking up trash, rem oving graffiti, and repairing broken w indow s. A w eak study m ight com pare crim e rates o n streets w ith varying levels o f p ro p erty disrepair. A m ore convincing study m ight com pare crim e rates on streets th a t c u r rently experience different levels o f blight b u t in the p ast h ad sim ilar rates o f disrepair and crim e. But even the latter study m ay still be u nconvincing because u n m easu red factors, such as th e closing o f a large local business, m ay have caused som e streets to deteriorate physically and coincided w ith an upsurge in crim e.11 D eterm ined to conquer the problem o f un o b serv ed confounders, one could set ou t to m easure each and every one o f the u n m easu red factors. The in trep id researcher w ho em barks on this daunting task confronts a fu n d am en tal problem : no one can be sure w hat the set o f u n m easu red factors com prises. The list o f all p o ten tial co n fo u n d ers is essentially a bottom less pit, and the search has no w ell-defined stopping rule. In the social sciences, research literatures routinely becom e m ired in disputes ab o u t unobserved confounders and w hat to do about them .
1.2
Experiments as a Solution to the Problem of Unobserved Confounders
The challenge for those w ho seek to answ er causal questions in a convincing fashion is to com e up w ith a research strategy th at does n o t require th em to identify, let alone m easure, all potential confounders. Gradually, over the course o f centuries, researchers developed procedures designed to sever the statistical relationship betw een the treat m ent and all variables that predict outcom es. The earliest experim ents, such as L inds study o f scurvy in the 1750s and W atsons study o f sm allpox in the 1760s, intro d u ced the m eth o d o f systematically tracking the effects o f a researcher-induced intervention by com paring outcom es in the treatm ent group to outcom es in one or m ore control groups.12 O ne im portant lim itation o f these early studies is that they assum ed th at th eir subjects were identical in term s o f th eir m edical trajectories. W h at if this assum ption 10 W ilson and Kelling 1982. 11 See Keizer, Lindenberg, and Steg 2008, b u t note th at this study does n ot em ploy random assignm ent. For a random ized field experim ent see M azerolle, Price, and Roehl 2000. 12 Hughes 1975; Boylston 2008.
6
INTRODUCTION
were false, and treatm ents tended to be adm inistered to patients w ith the best chances of recovery? C oncerned that the apparent effects o f their intervention m ight be attrib utable to extraneous factors, researchers placed increasing emphasis on the procedure by w hich treatm ents were assigned to subjects. M any pathbreaking studies of the nineteenth century assigned subjects alternately to treatm ent and control in an effort to m ake the experim ental groups comparable. In 1809, a Scottish m edical student described research conducted in Portugal in w hich arm y surgeons treated 366 sick soldiers alternately w ith bloodletting and other palliatives.13In the 1880s, Louis Pasteur tested his anthrax vaccine on anim als by alternately selecting anim als for treatm ent and exposing them to the bacteria. In 1898, Johannes Fibiger assigned an experim ental treatm ent to diphtheria patients adm itted to a hospital in Copenhagen on alternate days.14 A lternating designs were com m on in early agricultural studies and investi gations of clairvoyance, although researchers gradually came to recognize potential pitfalls o f alternation.15O ne problem w ith alternating designs is that they cannot defin itively rule out confounding factors, such as sicker diphtheria patients com ing to the hospital on certain days of the week. The first to recognize the full significance o f this point was the agricultural statistician R. A. Fisher, w ho in the mid-1920s argued vigor ously for the advantages o f assigning observations at random to treatm ent and control conditions.16 This insight represents a w atershed m om ent in the history o f science. Recogniz ing that no planned design, no m atter how elaborate, could fend off every possible system atic difference betw een the treatm ent and control groups, Fisher laid out a general procedure for elim inating system atic differences betw een treatm en t and con trol groups: ran d o m assignm ent. W hen we speak o f experim ents in this volume, we refer to studies in w hich som e kind of random procedure, such as a coin flip, deter m ines w hether a subject receives a treatm ent. O ne rem arkable aspect of the history of random ized experim entation is that the idea o f random assignm ent occurred to several ingenious people centuries before it was in troduced into m o d ern scientific practice. For example, the notion that one could use random assignm ent to form com parable experim ental groups seems to have been apparent to the Flem ish physician Jan Baptist Van H elm ont, whose 1648 m anuscript “O rigin of M edicine” challenged the proponents o f bloodletting to p er form the following random ized experim ent: Let us take out of the hospitals . . . 200 or 500 poor people, that have fevers, pleuri sies. Let us divide them into halves, let us cast lots, that one halfe of them may fall to 13 Chalm ers 2001. 14 Hrobjartsson, G 0tzsche, and Gluud 1998. 15 Merrill 2010. For further reading on the history of experim entation, see C ochran 1976; Forsetlund, Chalmers, and Bj0rndal 2007; Hacking 1990; and Salsburg 2001. See Greenberg and Shroder 2004 on social experim ents and G reen and G erber 2003 on the history o f experim ents in political science. 16 Box 1980, p. 3.
INTRODUCTION
7
my share, and the other to yours; I will cure them without bloodletting and sensible evacuation; but you do, as ye know . . . We shall see how many funerals both of us shall have.17
U nfortunately for those w hose physicians prescribed b lo o d lettin g in the centuries following Van H elm ont, he never co nducted his p ro p o sed experim ent. O ne can find sim ilar references to hypothetical experim ents dating back to m edieval tim es, b u t no indication th at any were actually p u t into practice. U ntil the advent o f m o d e rn statis tical th eo ry in the early tw entieth century, the pro p erties o f ra n d o m assignm ent were n ot fully appreciated, n o r w ere they discussed in a system atic m a n n e r th a t w ould have allowed one generation to reco m m en d th e idea to the next. Even after Fisher s ideas becam e w idely know n in the w ake o f his 1935 b o o k The Design o f Experim ents, random ized designs m et resistance from m edical researchers until the 1950s, and random ized experim ents did n o t catch on in the social sciences until the 1960s.18 In the class o f brilliant tw en tieth -cen tu ry discoveries, th e idea o f ran d o m ization contrasts sharply w ith the idea o f relativity, w hich lay com pletely h id d en until uncovered by genius. R andom ization was m ore akin to crude oil, so m e th in g th at periodically bubbled to the surface b u t rem ain ed u n ta p p ed for centuries until its extrao rd in ary practical value cam e to be appreciated.
1.3
Experiments as Fair Tests
In the contentious w orld o f causal claims, ran d o m ized ex p erim en tatio n represents an evenhanded m eth o d for assessing w hat w orks. The p ro ced u re o f assigning tre a t m ents at ran d o m ensures th a t there is n o system atic ten d en cy for either th e tre a t m en t or control group to have an advantage. If subjects w ere assigned to treatm en t and control groups and n o tre atm en t were actually ad m in istered , th ere w ould be no reason to expect th at one group w ould o u tp erfo rm th e other. In o th e r w ords, ra n d o m
17 Chalm ers 2001, p. 1157. 18 The advent o f random ized experim entation in social and m edical research took roughly a quarter century. Shortly after laying the statistical foundations for random assignm ent and the analysis o f ex perim ental data, Fisher collaborated on the first random ized agricultural experim ent (Eden and Fisher 1927). W ithin a few years, A m berson, M cM ahon, and P inner (1931) perform ed w hat appears to be the first random ized m edical experim ent, in w hich tuberculosis patients were assigned to clinical trials based on a coin flip. The large-scale studies o f tuberculosis conducted during the 1940s brought random ized clinical trials to the forefront o f medicine. Shortly afterw ard, the prim acy o f this m ethodology in m edicine was cem ented by a series o f essays by Hill (1951, 1952) and subsequent acclaim o f the polio vaccine trials of the 1950s (Tanur 1989). R andom ized clinical trials gradually cam e to be heralded as the gold standard by which medical claims were to be judged. By 1952, books such as K em pthornes Design and Analysis o f Experiments (pp. 125-126) declared that “only w hen the treatm ents in the experim ent are applied by the experim enter using the full random ization procedure is the chain o f inductive inference sound.”
8
INTRODUCTION
assignm ent implies th at the observed and unobserved factors th at affect outcom es are equally likely to be present in the treatm ent and control groups. Any given experim ent m ay overestim ate or underestim ate the effect o f the treatm ent, but if the experim ent were conducted repeatedly u n d er sim ilar conditions, the average experim ental result w ould accurately reflect the tru e treatm ent effect. In C hapter 2, we will spell out this feature of random ized experim ents in greater detail w hen we discuss the concept of unbiased estim ation. Experim ents are fair in another sense: they involve transparent, reproducible procedures. The steps used to conduct a random ized experim ent m ay be carried out by any research group. A random procedure such as a coin flip m ay be used to allo cate observations to treatm ent or control, and observers can m o n ito r the random assignm ent process to m ake sure that it is followed faithfully. Because the allocation process precedes th e m easurem ent of outcom es, it is also possible to spell out before h an d the way in w hich the data will be analyzed. By autom ating the process o f data analysis, one lim its the role of discretion that could com prom ise the fairness o f a test. R andom allocation is the dividing line that separates experim ental from nonexperim ental research in the social sciences. W hen w orking w ith nonexperim ental data, one cannot be sure w hether the treatm ent and control groups are com parable because no one know s precisely why som e subjects and not others cam e to receive the treatm ent. A researcher m ay be prepared to assum e that the two groups are com parable, but assum ptions that seem plausible to one researcher m ay strike another as far-fetched. This is not to say that experim ents are free from problem s. Indeed, this book w ould be rather b rief were it not for the m any com plications th at m ay arise in the course of conducting, analyzing, and interpreting experim ents. Entire chapters are devoted to problem s of noncom pliance (subjects w ho receive a treatm en t o th er than the one to w hich they were random ly assigned), attrition (observations for w hich outcom e m easurem ents are unavailable), and interference betw een units (observa tions influenced by the experim ental conditions to w hich o th er observations are assigned). The th reat o f bias rem ains a constant concern even w hen conducting experim ents, w hich is why it is so im portan t to design and analyze them w ith an eye tow ard m aintaining sym m etry betw een treatm ent and control groups and, m ore generally, to em bed the experim ental enterprise in institutions that facilitate proper reporting and accum ulation of experim ental results.
1.4
Field Experiments
Experim ents are used for a wide array of different purposes. Sometimes the aim of an experim ent is to assess a theoretical claim by testing an im plied causal relationship. Game theorists, for example, use laboratory experim ents to show how the introduction
INTRODUCTION
9
BOX 1.1 Experim ents in the N atural Sciences Readers w ith a background in the natural sciences m ay find it surprising that ran d o m assignm ent is an integral p a rt o f the definition o f a social science experi m ent. W hy is random assignm ent often unnecessary in experim ents in, for example, physics? Part o f the answ er is that the “subjects” in these experim ents— e.g., electrons—are m ore or less interchangeable, and so the m eth o d used to assign subjects to treatm ent is inconsequential. A n o th er p art o f the answ er is th at lab conditions neutralize all forces o ther th an the treatm ent. In the life sciences, subjects are often different from one another, and elim inating u nm easured disturbances can be difficult even u n d e r carefully controlled conditions. A n instructive exam ple m ay be fo u n d in a study by Crabbe, W ahlsten, and D udek (1999), w ho p erfo rm ed a series o f experim ents on m ouse behavior in three different science labs. As L ehrer (2010) explains: Before [Crabbe] conducted the experiments, he tried to standardize every variable he could think of. The same strains of mice were used in each lab, shipped on the same day from the same supplier. The animals were raised in the same kind of enclosure, with the same brand of sawdust bedding. They had been exposed to the same amount of incandescent light, were living with the same number of littermates, and were fed the exact same type of chow pellets. When the mice were handled, it was with the same kind of surgical glove, and when they were tested it was on the same equipment, at the same time in the morning.
N evertheless, experim ental interventions p ro d u ced m arkedly different results across m ice an d research sites.
o f u n certain ty or the o p p o rtu n ity to exchange in fo rm atio n p rio r to negotiating affects th e bargains that participants strike w ith one another.19 Such experim ents are often couched in very abstract term s, w ith rules th a t stylize th e features o f an au c tion, legislative session, or in tern atio n al dispute. The p articip an ts are typically o rd i n ary people (often m em bers o f th e university com m unity), n o t trad ers, legislators, o r diplom ats, and the laboratory en v iro n m en t m akes th e m keenly aw are th at th ey are p articipating in a research study. At the o th er en d o f the sp ectru m are experim ents th a t strive to be as realistic and unobtrusive as possible in an effort to test m ore context-specific hypotheses. 19 See Davis and Holt 1993; Kagel and Roth 1995; Guala 2005.
10
INTRODUCTION
Q uite often this type of research is inspired by a m ixture o f theoretical and practical concerns. For example, to w hat extent and u n d er w hat conditions does preschool im prove subsequent educational outcom es? Experim ents th at address this question shed light on theories about childhood developm ent while at the same tim e in fo rm ing policy debates about w hether and how to allocate resources to early childhood education in specific com m unities. The push for realism and unobtrusiveness stem s from the concern that unless one conducts experim ents in a naturalistic setting and m anner, som e aspect o f the experim ental design m ay generate results that are idiosyncratic o r m isleading. If sub jects know that they are being studied or if they sense th at the treatm ent they received is supposed to elicit a certain kin d of response, they m ay express the opinions or rep o rt the behavior they believe the experim enter w ants to hear. A treatm en t may seem effective until a m ore unobtrusive experim ent proves otherw ise.20 C onducting research in naturalistic settings m ay be viewed as a hedge against unforeseen threats to inference that arise w hen draw ing generalizations from results obtained in labo ratory settings. Just as experim ents are designed to test causal claims w ith m inim al reliance on assum ptions, experim ents conducted in real-w orld settings are designed to m ake generalizations less dependent on assum ptions. R andom ized studies that are conducted in real-w orld settings are often called field experim ents, a term that calls to m in d early agricultural experim ents th at were literally conducted in fields. The problem w ith the term is th at the w ord field refers to th e setting, but the setting is just one aspect o f an experim ent. O ne should invoke not one b ut several criteria: w hether the treatm en t used in the study resem bles the inter vention o f interest in the w orld, w hether the participants resem ble the actors who ordinarily encounter these interventions, w hether the context w ithin w hich subjects
20 W hether this concern is justified is an em pirical question, and the answer may well depend on the setting, context, and subjects. Unfortunately, the research literature on this topic rem ains underdeveloped. Few studies have attem pted to estim ate treatm ent effects in both lab and field contexts. Gneezy, Haruvy, and Yafe (2004), for example, use field and lab studies to test the hypothesis that the quantity of food consum ed depends on w hether each diner pays for his or her own food or w hether they all split the bill. W hen this experim ent is conducted in an actual cafeteria, splitting the bill leads to significantly more food consum ption; w hen the equivalent game is played in abstract form (with m onetary payoffs) in a nearby lab, the average effect is weak and not statistically distinguishable from zero. Jerit, Barabas, and Clifford (2012) com pare the effects o f exposure to a local newspaper on political knowledge and opinions. In the field, free Sunday newspapers were random ly distributed to households over the course o f one m onth; in the lab, subjects from the same population were invited to a university setting, where they were presented with the four m ost prom inent political news stories airing during the same m onth. For the 17 outcome measures, estim ated treatm ent effects in the lab and field are found to be weakly correlated (Table 2). See also Rondeau and List (2008), w ho com pare the effectiveness of different fundraising appeals on behalf of the Sierra Club directed at 3,000 past donors, as m easured by actual donations. The fundraising appeals, w hich involve various com binations o f m atching funds, thresholds, and m oney-back guarantees, are then presented in abstract form in a lab setting w ith m onetary payoffs. The correspondence between lab and field results was relatively weak, with average contributions in the lab predicting about 5% of the variance in average contributions in the field across the four conditions.
INTRODUCTION
11
receive the treatm en t resem bles the context o f interest, an d w h eth er the outcom e m easures resem ble the actual outcom es o f theoretical o r practical interest. For exam ple, suppose one w ere interested in th e extent to w hich financial contributions to in cu m b en t legislators’ reelection cam paigns buy d o n o rs access to the legislators, a topic o f great interest to those co n cern ed th at the access accorded to w ealthy donors und erm in es dem ocratic representation. The hypothesis is th a t the m ore a d o n o r contributes, the m ore likely the legislator is to grant a m eeting to discuss the d o n o r’s policy prescriptions. O ne possible design is to recru it stu d en ts to play the p art o f legislative schedulers and present th em w ith a list o f requests for m eetings from an assortm ent o f constituents an d d ono rs in o rd er to test w h eth er people described as potential d onors receive priority. A n o th er design involves th e sam e exercise, but this tim e the subjects are actual legislative schedulers.21 The latter design w ould seem to provide m ore convincing evidence ab o u t th e relationship betw een donations and access in actual legislative settings, b u t the degree o f experim ental realism rem ains am biguous. The treatm ents in this case are realistic in th e sense th at they resem ble w hat an actual scheduler m ig h t confront, b u t the subjects are aware th at th ey are participating in a sim ulation exercise. U n d er scru tin y by researchers, legislative schedulers m ight try to appear indifferent to fu n d raisin g considerations; in an actual legislative setting w here principals provide feedback to schedulers, d o n o rs m ight receive special consideration. M ore realistic, th en , w ould be an ex p erim en t in w hich one o r m ore donors contribute ran d o m ly assigned sum s o f m o n ey to various legislators and request m eetings to discuss a policy o r adm inistrative concern. In this design, the subjects are actual schedulers, the tre a tm e n t is a cam paign don atio n , the treatm en t and request for a m eeting are authentic, and th e ou tco m e is w h eth er a real request is granted in a tim ely fashion. Because the degree o f “fieldness” m ay be gauged along four different dim ensions (authenticity o f treatm ents, participants, contexts, an d outcom e m easures), a p ro p er classification schem e w ould involve at least sixteen categories, a taxonom y th a t far exceeds anyone’s interest or patience. Suffice it to say th a t field experim ents take m any form s. Some experim ents seem naturalistic on all dim ensions. Sherm an et al. w orked w ith the Kansas C ity police d ep artm en t in o rd e r to test th e effectiveness o f police raids on locations w here d ru g dealing was suspected.22 The treatm en ts w ere raids by team s o f unifo rm ed police directed at 104 ran d o m ly chosen sites am ong the 207 locations for w hich w arrants h a d been issued. O utcom es w ere crim e rates in nearby areas. K arlan and List collaborated w ith a charity in o rd er to test th e effectiveness o f alternative fundraising appeals.23 The treatm ents w ere fu n d raisin g letters; the ex p eri m en t was unobtrusive in the sense th at recipients o f th e fu n d raisin g appeals were
21 See Chin, Bond, and Geva 2000. 22 Sherm an et al. 1995. 23 Karlan and List 2007.
12
INTRODUCTION
unaw are that an experim ent was being conducted; and the outcom es were financial donations. Bergan team ed up w ith a grassroots lobbying organization in order to test w hether constituents e-m ail to state representatives influences roll call voting.24 The lobbying organization allowed Bergan to extract a random control group from its list o f targeted legislators; otherw ise, its lobbying cam paign was conducted in the usual way, and outcom es were assessed based on the legislators’ floor votes. M any field experim ents are less naturalistic, and generalizations draw n from them are m ore dependent on assum ptions. Som etim es the interventions deployed in the field are designed by researchers rather th an practitioners. Eldersveld, for exam ple, fashioned his ow n get-out-the-vote cam paigns in order to test w hether m obili zation activities cause registered voters to cast ballots.25 M uch m ay be learned w hen researchers craft th e ir ow n treatm ents-—indeed, the developm ent o f theoretically inspired interventions is an im portant way in w hich researchers m ay contribute to theoretical and policy debates. However, if the aim o f an experim ent is to gauge the effectiveness o f typical candidate- or party-led voter m obilization campaigns, researcher-led cam paigns m ay be unrepresentative in term s o f the messages used or the m an n er in w hich they are com m unicated. Suppose that th e researchers in terv en tion were to prove ineffective. This finding alone w ould n o t establish th at a typical cam paigns interventions are ineffective, although this interpretation could be b o l stered by a series o f follow-up experim ents th at test different types o f cam paign com m u nication.26 Som etim es treatm ents are adm inistered and outcom es are m easured in a way that notifies participants that they are being studied, as in Palucks experi m ental investigation o f intergroup prejudice in Rwanda.27 H er study enlisted groups of Rwandan villagers to listen to recordings o f radio program s on a m onthly basis for a period o f one year, at w hich point outcom es were m easured using surveys and role-playing exercises. Finally, experim ental studies w ith relatively little field content are those in w hich actual interventions are delivered in artificial settings to subjects w ho are aware that they are p a rt of a study. Examples o f this type o f research m ay be found in the dom ain o f com m ercial advertising, w here subjects are show n different types of ads either in the context of an Internet survey or in a lab located in a sh o p ping center.28 W hether a given study is regarded as a field experim ent is partly a m atter o f perspective. O rdinarily, experim ents that take place on college cam puses are consid 24 Bergan 2009. 25 Eldersveld 1956. 26 For example, in an effort to test w hether voter m obilization phone calls conducted by call centers are typically ineffective, Panagopoulos (2009) com pares partisan and nonpartisan scripts, Nickerson (2007) assesses w hether effectiveness varies depending on the quality of the calling center, and other scholars have conducted studies in various electoral environm ents. See Green and Gerber 2008 for a review of this literature. 27 Paluck 2009. 28 See, for example, C linton and Lapinski 2004; Kohn, Smart, and O gborne 1984.
INTRODUCTION
13
ered lab studies, b u t som e experim ents on cheating involve realistic o p p o rtu n ities for students to copy answ ers or m isrep o rt th eir ow n p erfo rm an ce on self-graded tests.29 An experim ental study th at exam ines th e d e terre n t effect o f exam p ro cto rin g w ould am o u n t to a field experim ent if one’s aim were to u n d e rstan d the conditions u n d e r w hich students cheat in school. This exam ple serves as a re m in d e r th a t w hat co n sti tutes a field experim ent depends on how “th e field” is defined.
1.5
Advantages and Disadvantages of Experimenting in Real-World Settings
M any field experim ents take the form o f “p ro g ram evaluations” designed to gauge the extent to w hich resources are deployed effectively. For exam ple, in o rd er to test w h eth er a political candidate’s T V advertising cam paign increases h er popularity, a field ex p erim ent m ight random ize the geographic areas in w hich th e ads are deployed and m easure differences in voter su p p o rt betw een tre atm en t an d control regions. From the stan d p o in t o f program evaluation, this type o f ex p erim en t is arguably su p e rio r to a laboratory study in w hich voters are ran d o m ly show n th e candidate’s ads and later asked th eir views about th e candidate. The field ex p erim en t tests the effects o f deploying the ads and allows for th e possibility th a t som e voters in targeted areas will m iss the ad, w atch it inattentively, or forget its m essage am id life’s o th e r distractions. In terp retation o f the lab ex p erim en t’s results is com plicated by th e fact th at subjects in lab settings m ay respond differently to the ads th a n th e average voter outside the lab. In this application, prelim inary lab research m ig h t be useful insofar as it suggests w hich m essages are m ost likely to w ork in field settings, b u t only a field ex p erim en t allows th e researcher to reliably gauge the extent to w hich an actual ad cam paign changed votes and to express this outcom e in relation to the resources sp en t on the cam paign. As we m ove from p ro g ram evaluation to tests o f th eo retical p ro p o sitio n s, the relative m erits o f field and lab settings becom e less clear-cut. A practical advantage o f delivering treatm en ts u n d e r controlled lab o rato ry co n d itio n s is th a t one can m ore easily a d m in ister m ultiple variations o f a tre a tm e n t to test fin e-g rain ed theoretical propositions. Field interventions are often m ore cum bersom e: in th e case o f political advertisem ents, it m ay be logistically challenging o r politically risky to air m ultiple advertisem ents in different m edia m arkets. O n th e o th e r h an d , field ex perim ents are som etim es able to achieve a high level o f th eo retical n u an ce w h en a w ide array o f treatm en ts can be d istrib u ted across a large p o o l o f subjects. Field ex p erim en ts th a t deploy m ultiple versions o f a tre a tm e n t are co m m o n , for exam ple, in research
29 C anning 1956; Nowell and Laufer 1997.
H
INTRODUCTION
on d iscrim ination, w here researchers vary ethnicity, social class, and a host o f o th er characteristics to b etter u n d erstan d the conditions u n d er w hich discrim ination occurs.30 Even w hen lim ited to a single, relatively blunt intervention, a researcher m ay still have reason to conduct experim ents in the field. A dvertising research in field settings is often unobtrusive in the sense that subjects are n o t viewing the ad at the behest of a researcher, and outcom es are m easured in a way that does not alert subjects to the fact that they are being studied.31W hereas outcom es in lab settings are often attitudes and behaviors th at can be m easured in the space o f one sitting,32 field studies tend to m onitor behaviors over extended periods o f tim e. The im portance o f ongoing outcom e m easurem ent is illustrated by experim ents th at find strong instantaneous effects o f political advertising that decay rapidly over tim e.33 Perhaps the biggest disadvantage o f conducting experim ents in th e field is that they are often challenging to im plem ent. In contrast to the lab, w here researchers can m ake unilateral decisions about w hat treatm ents to deploy, field experim ents are often the p ro d u ct o f coordination betw een researchers and those w ho actually carry out the interventions or furnish data on subjects' outcom es. O rr34 and G ueron35 offer helpful descriptions of how these partnerships are form ed and n u rtu red over the course of a collaborative research project. Both authors stress the im portance o f building consensus about the use o f ran d o m assignm ent. Research partners and funders som etim es balk at the idea of random ly allocating treatm ents, preferring instead to treat everyone or a hand-picked selection o f subjects. The researcher m ust be prepared to form ulate a palatable experim ental design and to argue convincingly th at the proposed use o f random assignm ent is b o th feasible and ethical. The authors also stress that successful im plem entation o f the agreed-upon experim ental design— the allocation o f subjects, the adm inistration o f treatm ents, and the m easurem ent of outcom es—requires planning, pilot testing, and constant supervision. M anaging research collaboration w ith schools, police departm ents, retail firms, or political cam paigns sounds difficult and often is. Nevertheless, field experim enta tion is a rapidly grow ing form o f social science research, encom passing h undreds of
30 See Doleac and Stein 2010 for a study of racial discrim ination by bidders on Internet auctions or Pager, Western, and Bonikowski 2009 for a study o f labor m arket discrim ination. We discuss discrim ination experim ents in C hapters 9 and 12. 31 In cases where surveys are used to assess outcomes, m easurem ent may be unobtrusive in the more lim ited but nevertheless im portant sense that subjects are unaware that the survey aims to gauge the effects of the intervention. 32 O rchestrating retu rn visits to the lab often presents logistical challenges, and failure to attract all sub jects back to the lab m ay introduce bias (see C hapter 7). 33 See, for example, Gerber, Gimpel, Green, and Shaw 2011. See also the discussion of outcom e m easure m ent in Chapter 12. 34 O rr 1999, C hapter 5. 35 G ueron 2002.
INTRODUCTION
15
studies on topics like education, crim e, em ploym ent, savings, d iscrim ination, ch ari table giving, conservation, an d political particip atio n .36 The set o f n o tew o rth y and influential studies includes experim ents o f every possible description: sm all-scale interventions designed and im plem ented by researchers; collaborations betw een researchers and firms, schools, police agencies, o r political cam paigns; and m assive go v ern m ent-funded studies o f incom e taxes, h ealth insurance, schooling, and public housing.37 Tim e and again, researchers overcom e practical hurdles, an d the b o u n d aries o f w hat is possible seem to be continually expanding. C onsider, for exam ple, research on how to prom ote governm ent accountability. U ntil the 1990s, research in this dom ain was alm ost exclusively n o nexperim ental, b u t a series o f pathbreaking stu d ies have show n th at one can use experim ents to investigate the effects o f governm ent audits and com m unity forum s on accounting irregularities am ong public w orks p ro gram s,38 the effects o f grassroots m o n ito rin g efforts on th e p erfo rm an ce o f legisla tors,39 and the effects of inform ation about constituents preferences on legislators’ roll call votes.40 Field experim ents are som etim es faulted for th eir inability to address big questions, such as the effects o f culture, wars, o r constitutions, b u t researchers have grow n increasingly adept at designing experim ents th at test th e effects o f m ech anism s th at are tho u g h t to tra n sm it the effects o f the h a rd -to -m an ip u late variables.41 Given the rapid pace of innovation, the potential for experim ental in q u iry rem ains an op en question.
1.6
Naturally Occurring Experiments and Q uasi-Experim ents
A n o th er way to expand the do m ain o f w hat m ay be stu d ied experim entally is to seize on naturally occurring experim ents. E xperim ental research o p p o rtu n ities arise w hen interventions are assigned by a governm en t o r in stitu tio n .42 For exam ple, the 36 M ichalopoulos 2005; Green and G erber 2008. 37 See, e.g., Robins 1985 on incom e taxes; N ew house 1989 on health insurance; K rueger and W hitm ore 2001 and U.S. D epartm ent o f Health and H um an Services 2010 on schooling. O n public housing, see Sanbonm atsu et al. 2006; H arcourt and Ludwig 2006; and Kling, Liebman, and Katz 2007. 38 O lken 2007. 39 H um phreys and W einstein 2010; Grose 2009. 40 Butler and N ickerson 2011. 41 Ludwig, Kling, and M ullainathan 2011; C ard, Della Vigna, and M alm endier 2011. 42 Unfortunately, the term “natural experim ent” is som etim es used quite loosely, encom passing n ot only naturally occurring random ized experim ents but also any observational study in w hich the m ethod o f as signm ent is haphazard or inscrutable. We categorize studies that use near-random or arguably random as signm ent as quasi-experim ents. For definitions o f the term natural experim ent that do not require random assignm ent, see D unning 2012 and Shadish, Cook, and Cam pbell 2002, p. 17.
16
INTRODUCTION
V ietnam draft lottery,43 the random assignm ent o f defendants to judges,44 the random audit of local m unicipalities in Brazil,45 lotteries th at assign parents the opportunity to place their children in different public schools,46 the assignm ent o f Indian local governm ents to be headed by w om en or m em bers o f scheduled castes,47 the alloca tion o f visas to those seeking to im m igrate,48 and legislative lotteries to determ ine w hich representative will be allowed to propose legislation49 are a few examples where random ization procedures have been em ployed by governm ent, setting the stage for an experim ental analysis. Researchers have also seized on natural experim ents conducted by nongovernm ental institutions. Universities, for example, occasionally random ize the pairing of room m ates, allocation o f instructors, and com position of tenure review com m ittees.50 Sports of all kinds use coin flips and lotteries to assign everything from the sequence of play to the colors w orn by the contestants.51 This list o f naturally occurring experim ental opportunities m ight also include revisiting ra n dom allocations conducted for other research purposes. A downstream experiment refers to a study w hose intervention affects not only the proxim al outcom e o f interest but, in so doing, potentially influences other outcom es as well (see C hapter 6). For example, a researcher m ight revisit an experim ent that induced an increase in high school graduation rates in order to assess w hether this random ly induced change in educational attainm ent in tu rn caused an increase in voter tu rn o u t.52 In this book, we scarcely distinguish betw een field experim ents and naturally occurring experim ents, except to note that extra effort is som etim es required in order to verify th at draft boards, court systems, or school districts im plem ented random assignm ent. Q uite different are quasi-experim ents, in w hich n ear-ran d o m processes cause places, groups, or individuals to receive different treatm ents. Since the mid-1990s, a grow ing n u m b er o f scholars have studied instances w here in stitu tio n al rules cause n ear-ra n d o m tre a tm e n t assignm ents to be allocated am ong those w ho fall just short o f or just beyond a cutoff, creating a discontinuity. O ne o f the m ost fam ous exam ples o f this research design is a study o f U.S. congressional districts in w hich one p arty ’s candidate narrow ly w ins a plurality o f votes.53 The sm all shift in votes that separates a n arrow victory from a narrow defeat produces a tre a tm e n t—w inning the seat in the H ouse o f R epresentatives—th at m ight be co n stru ed as random . O ne
43 44 45 46 47 48 49 50 51 52 53
Angrist 1991. Kling 2006; Green an d W inik 2010. Ferraz and Finan 2008. Hastings, Kane, Staiger, and W einstein 2007. Beaman et al. 2009; Chattopadhyay and Duflo 2004. Gibson, McKenzie, and Stillman 2011. Loewen, Koop, Settle, and Fowler 2010. Sacerdote 2001; Carrell and West 2010; De Paola 2009; Zinovyeva and Bagues 2010. Hill and Barton 2005; see also Rowe, Harris, and Roberts 2005 for a response to Hill and Barton. Sondheim er and G reen 2009. Lee 2008.
INTRODUCTION
17
could com pare near-w inners to near-losers in o rd e r to assess th e effect o f a n arro w victory on th e probability th a t the w in n in g p a rty w ins reelection in th e d istrict tw o years later. Because quasi-experim ents do n o t involve an explicit ran d o m assignm ent p ro ce dure, the causal inferences they su p p o rt are subject to greater uncertainty. A lthough the researcher m ay have good reason to believe th at observations on opposite sides o f an arb itrary threshold are com parable, there is always som e risk th a t th e observations m ay have “sorted” them selves so as to receive o r avoid the treatm en t. C ritics w ho have looked closely at the pool o f congressional candidates w ho narrow ly w in o r lose have p o in ted out th at there appear to be system atic differences betw een near-w inners and near-losers in term s o f th eir political resources.54 The sam e concerns apply to a w ide array o f quasi-experim ents th at take w eather patterns, natural disasters, colonial settlem ent patterns, natio n al boundaries, election cycles, assassinations and so forth to be n e ar-ra n d o m “treatm ents.” In the absence o f ran d o m assignm ent, there is always som e u n certain ty about h ow nearly ra n d o m these treatm ents are. A lthough these studies are sim ilar in spirit to field ex p erim en tation insofar as they strive to illum inate causal effects in real-w orld settings, they fall outside the scope of this b o o k because they rely on arg u m en tatio n rath er th an ran d o m ization procedures. In o rd er to presen t a single, co h eren t perspective on experim ental design and analysis, this b o o k confines its atten tio n to ran d o m ized experim ents.
1.7
Plan of the Book
This chapter has introduced a variety o f im p o rtan t concepts w ith o u t pausing for rigorous definitions or proofs. C hapter 2 delves m ore deeply into the properties o f experim ents, describing in detail the underlying assum ptions th a t m u st be m et for experim ents to be inform ative. C hapter 3 introduces th e concept o f sam pling v ari ability, the statistical un certain ty in tro d u ced w henever subjects are ran d o m ly allo cated to treatm en t and control groups. C hapter 4 focuses on how covariates, variables th at are m easured p rio r to the ad m in istratio n o f th e treatm en t, m ay be used in
54 G rim m er et al. 2011; Caughey and Sekhon 2011. In addition, regression discontinuity analyses often confront the following conundrum : the causal effect o f the treatm ent is identified at the point o f disconti nuity, but data are sparse in the close vicinity o f the boundary. O ne m ay expand the com parison to include observations farther from the boundary, but doing so jeopardizes the com parability o f groups th at do or do not receive the treatm ent. In an effort to correct for unm easured differences betw een the groups, researchers typically use regression to control for trends on either side o f the boundary, a m ethod that introduces a variety of m odeling decisions and attendant uncertainty. See Im bens and Lemieux 2008 and G reen et al. 2009.
18
INTRODUCTION
experim ental design and analysis. C hapters 5 and 6 discuss the com plications that arise w hen subjects are assigned one treatm ent but receive another. The so-called noncompliance or failure-to-treat problem is sufficiently com m on and conceptually challenging to w arrant two chapters. C hapter 7 addresses the problem o f attrition, or the failure to obtain outcom e m easurem ents for every subject. Because field experi m ents are frequently conducted in settings where subjects com m unicate, compare, or rem em ber treatm ents, C hapter 8 considers the com plications associated w ith in terference betw een experim ental units. Because researchers are often interested in learning about the conditions under w hich treatm ent effects are especially strong or weak, C hapter 9 discusses the detection o f heterogeneous treatm ent effects. C hap ter 10 considers the challenge of studying the causal pathways by w hich an experi m ental effect is transm itted. C hapter 11 discusses how one m ight draw generalizations that go beyond the average treatm ent effect observed in a particular sam ple and apply them to the average treatm ent effect in a broader population. The chapter provides a b rief introduction to m eta-analysis, a statistical technique that pools data from m ul tiple experim ents in order to sum m arize the findings o f a research literature. C hap ter 12 discusses a series of notew orthy experim ents in order to highlight im portant principles introduced in previous chapters. C hapter 13 guides the reader through the com position of an experim ental research report, providing a checklist o f key aspects o f any experim ent that m ust be described in detail. A ppendix A discusses regulations that apply to research involving hum an subjects. In order to encourage you to put the b o o k s ideas to work, A ppendix B suggests several experim ental projects that involve low cost and m inim al risk to hum an subjects. SUGGESTED READINGS Accessible introductions to experimental design in real-world settings can be found in Shadish, Cook, and Campbell 2002 and Torgerson and Torgerson 2008. For a discussion of the limi tations of field experimentation, see Heckman and Smith 1995. Morgan and Winship (2007), Angrist and Pischke (2009), and Rosenbaum (2010) discuss the challenges of extracting causal inferences from nonexperimental data. Imbens and Lemieux (2008) provide a useful introduc tion to regression-discontinuity designs. EXERCISES: CHAPTER 1 1.
Core concepts: (a) What is an experiment, and how does it differ from an observational study? (b) What is “unobserved heterogeneity,” and what are its consequences for the interpre tation of correlations?
2.
Would you classify the study described in the following abstract as a field experiment, a naturally occurring experiment, a quasi-experiment, or none of the above? Why? “This study seeks to estimate the health effects of sanitary drinking water among lowincome villages in Guatemala. A random sample of all villages with fewer than 2,000
INTRODUCTION
19
inhabitants was selected for analysis. Of the 250 villages sampled, 110 were found to have unsanitary drinking water. In these 110 villages, infant mortality rates were, on average, 25 deaths per 1,000 live births, as compared to 5 deaths per 1,000 live births in the 140 villages with sanitary drinking water. Unsanitary drinking water appears to be a major contributor to infant mortality.” 3.
Based on what you are able to infer from the following abstract, to what extent does the study described seem to fulfill the criteria for a field experiment? “We study the demand for household water connections in urban Morocco, and the effect of such connections on household welfare. In the northern city of Tan* giers, among homeowners without a private connection to the city’s water grid, a random subset was offered a simplified procedure to purchase a household connec tion on credit (at a zero percent interest rate). Take-up was high, at 69%. Because all households in our sample had access to the water grid through free public taps .. . household connections did not lead to any improvement in the quality of the water households consumed; and despite a significant increase in the quantity of water consumed, we find no change in the incidence of waterborne illnesses. Never theless, we find that households are willing to pay a substantial amount of money to have a private tap at home. Being connected generates important time gains, which are used for leisure and social activities, rather than productive activities ”55
4.
A parody appearing in the British M edical Journal questioned whether parachutes are in fact effective in preventing death when skydivers are presented with severe “gravitational challenge.”56 The authors point out that no randomized trials have assigned parachutes to skydivers. Why is it reasonable to believe that parachutes are effective even in the absence of randomized experiments that establish their efficacy?
55 Devoto et al. 2011. 56 Smith and Pell 2003.
CHAPTER 2
C a u s a l Inference and Experimentation
A
lthough the logic o f experim entation is for the m o st p a rt intuitive, researchers can ru n into trouble if they lack a firm grasp o f th e key assum ptions th at m ust be m et in o rd er for experim ents to provide reliable assessm ents o f cause and
effect. This p o in t applies in p articu lar to field experim ental researchers, w ho m u st frequently m ake real-tim e decisions about research design. Failure to u n d e rstan d core statistical principles and th eir practical im plications m ay cause researchers to sq uan d er resources and experim ental opportu n ities. It is wise, therefore, to invest tim e studying the form al statistical properties o f ex perim ents before launching a research project. This chapter introduces a system o f n o tatio n th a t will be used th ro u g h o u t the book. By depicting the outcom es th a t potentially m anifest them selves d ep en d in g on w h eth er the treatm en t is adm inistered to each unit, the n o tatio n clarifies a n u m b e r o f key concepts, such as the idea o f a treatm en t effect. This n o tatio n al system is th en used to shed light on the conditions u n d e r w hich experim ents provide persuasive evi dence about cause an d effect. The chapter culm inates w ith a list o f core assum ptions and w hat they im ply for experim ental design. The advantage o f w orking m e th o d i cally from core principles is th at a long list o f design-related ad m o n itio n s flows from a relatively com pact set o f ideas th a t can be stored in w orking m em ory.
2.1
Potential Outcomes
Suppose we seek to gauge the causal effect o f a treatm en t. For concreteness, suppose we w ish to study the budgetary consequences o f having w om en, ra th e r th an m en, head In d ian village councils, w hich govern ru ral areas in W est Bengal an d R ajasthan.1
1
See C hattopadhyay and Duflo 2004.
21
22
CAUSAL IN F E R E N C E AND EXPERIMENTATION
What you w ill learn from this chapter: 1. The system of notation used to describe potential outcomes. 2. Definitions of core terms: average treatment effect, expectation, random a ssignm ent, and u n b iased ness.
3. A s s u m p t io n s that m u st be met in order for experim ents to produce unbi ased estim ates of the average treatment effect.
Students o f legislative politics have argued th at w om en bring different policy p rio ri ties to the budgetary process in developing countries, em phasizing health issues such as providing clean drinking water. Leave aside for the tim e being the question o f how this topic m ight be studied using random ly assigned treatm ents. For the m om ent, sim ply assum e that each village either receives the treatm ent (a w om an serves as vil lage council head) or rem ains untreated (w ith its village council headed by a m an). For each village, we also observe the share o f the local council budget th at is allocated to providing clean drinking water. To sum m arize, we observe the treatm ent (w hether the village head is a w om an or not) and the outcom e (w hat share o f th e budget goes to a policy issue of special im portance to wom en). W hat we do n o t observe is how the budget in each village headed by a m an would have been allocated if it had been headed by a w om an, and vice versa. A lthough we do not observe these counterfactual outcom es, we can nevertheless im agine them . Tak ing this m ental exercise one step further, we m ight im agine th at each village has two potential outcomes: the budget it w ould enact if headed by a w om an and the budget it w ould enact if headed by a m an. The gender o f the village head determ ines which potential budget we observe. The other budget rem ains im aginary or counterfactual. Table 2.1 provides a stylized example o f seven villages in order to introduce the n otation that we will use th ro ughout the book. The villages constitute the subjects in this experim ent. Each subject is identified by a subscript i, w hich ranges from 1 to 7. The th ird village on the list, for example, w ould be designated as i = 3. The table im agines w hat w ould happen u n d er two different scenarios. Let 7.(1) be the o u t com e if village i is exposed to the treatm ent (a w om an as village head), and let Y (0) be the outcom e if this village is not exposed to the treatm ent. For example, Village 3 allocates 30% of its budget to w ater sanitation if headed by a w om an but only 20% if headed by a m an, so, Y3( l) = 30%, and Y3(0) — 20%, These are called potential outcom es because they describe w hat w ould happen if a treatm ent were or were not adm inistered. For purposes o f this example, we assum e that each village has just two potential outcom es, depending on w hether it receives the treatm ent; villages are assum ed to be unaffected by the treatm ents that other villages receive. In section 2.7, we spell out
CAUSAL I N F E R E N C E AND EXPERIMENTATION
23
TABLE 2.1 I l l u s t r a t i o n of p o t e n t i a l o u t c o m e s f o r l o c a l b u d g e t s w h e n v i l l a g e c o u n c i l h e a d s a r e w o m e n o r m e n . ( E n t r i e s a r e s h a r e s of lo c a l b u d g e t s a l l o c a t e d to w a t e r sa n i t a t i o n . )
Village i
K( 0) Budget share if village head is male
7(1) Budget share if village head is female
Treatment effect
Village 1
10
15
5
Village 2
15
15
0
Village 3
20
30
10
Village 4
20
15
-5
Village 5
10
20
10
Village 6
15
15
0
Village 7
15
30
15
Average
15
20
5
T,
m ore precisely the assum ptions th at u nderlie the m odel o f p o ten tial outcom es and discuss com plications th a t arise w hen subjects are affected by th e treatm en ts th at o th er subjects receive.
2.2
Average Treatment Effects
For each village, the causal effect o f the tre a tm e n t ( t ) is defined as the difference betw een two potential outcom es: r. =
7 (1 ) -
7 ( 0 ).
(2.1)
In o th er w ords, the treatm en t effect for each village is th e difference betw een two potential states o f th e w orld, one in w hich the village receives th e tre atm en t and an o th er in w hich it does not. For Village 3, this causal effect is 30 — 20 = 10. The em pirical challenge th a t researchers typically face w hen observing outcom es is that at any given tim e one can observe 7 (1 ) o r 7 (0 ) b u t n o t b oth. (Bear in m in d th at the only reason we are able to see b o th p o ten tial outcom es for each village in Table 2.1 is th at this is a hypothetical example!) B uilding on th e n o tatio n al system in tro d u ced above, we define 7 as the observed outcom e in each village and d as the observed treatm en t th at is delivered in each village. In this case, 7 is the observed share o f the budget allocated to w ater sanitation, an d d. equals 1 w hen a w om an is village h ead and 0 otherw ise.
24
CAUSAL IN FER EN C E AND EXPERIMENTATION
BOX 2.1 Potential Outcomes Notation In this system o f notation, the subscript i refers to subjects 1 th ro u g h N . The variable d indicates w hether the zth subject is treated: d — 1 m eans the /th subject receives the treatm ent, and d — 0 m eans the /th subject does n ot receive the treatm ent. It is assum ed th at d is observed for every subject. Y.(l) is the potential outcom e if the /th subject were treated. Y.(0) is the potential outcom e if the /th subject were not treated. In general, potential o u t com es m ay be w ritten Y.(d)y w here d indexes the treatm ent. These potential outcom es are fixed attributes of each subject and represent the outcom e that w ould be observed hypothetically if that subject were treated or untreated. A schedule of potential outcom es refers to a com prehensive list o f p o ten tial outcom es for all subjects. The rows o f this schedule are indexed by z, and the colum ns are indexed by d. For example, in Table 2.1 the Y(0) and Y (l) potential outcom es for the fifth subject may be found in adjacent colum ns of the fifth row The connection betw een the observed outcom e Y and the underlying potential outcom es is given by the equation Y. = d.Y.( 1) + (1 — d.)Y.(O). This equation indicates that the Y (l) are observed for subjects w ho are treated, and the Y(0) are observed for subjects w ho are not treated. For any given subject, we observe either Y (l) or Y.(0), never both. It is som etim es useful to refer to potential outcom es for a subset o f all sub jects. Expressions o f the form Y.(-) \ X = x denote potential outcom es w hen the condition X = x holds. For example, Y(0) | d — 1 refers to the untreated potential outcom e for a subject w ho actually receives the treatm ent. Because we often w ant to know about the statistical properties o f a hypo thetical ran d o m assignm ent, we distinguish betw een dr the treatm en t that a given subject receives (a variable that one observes in an actual dataset), and D , the treatm en t that could be adm inistered hypothetically D . is a random variable, and the z'th subject m ight be treated in one hypothetical study and n ot in another. For example, Y (l) | D = 1 refers to the treated potential o u t com e for a subject w ho w ould be treated u n d e r som e hypothetical allocation o f treatm ents.
CAUSAL IN F E R E N C E AND EXPERIMENTATION
25
The budget th at we observe in each village m ay be su m m arized using the follow ing expression: Yf = dY ,{ 1) + (1 - d ) Y ( 0).
( 2 .2 )
Because d is either 0 or 1, one o f the term s on the right side o f the equals sign will always be zero. We observe the potential outcom e th at results from treatm ent, 7 (1 ), if the treatm ent is adm inistered (d. = 1). If th e treatm en t is n o t ad m in istered (d. = 0), we observe the potential outcom e th a t results w hen n o treatm en t occurs, 7(0 ). The average treatm ent effect, or ATE, is defined as the sum o f the r divided by N, the n u m b er o f subjects: 12.3]
An equivalent way to obtain the average treatm en t effect is to su b tract the average value o f 7 ( 0) from the average value o f 7(1):
The average treatm en t effect is an extrem ely im p o rtan t concept. Villages m ay have different r., b u t the ATE indicates how outcom es w ould change on average if every village were to go from u n treated (m ale village council head) to treated (fem ale vil lage council head). From the rightm ost colum n o f Table 2.1, we can calculate the ATE for the seven villages. The average treatm en t effect in this exam ple is 5 percentage points: if all villages were headed by m en, they w ould on average sp en d 15% o f th eir budgets on w ater sanitation, w hereas if all villages were headed by w om en, this figure w ould rise to 20%.
Definition: Average Treatm ent Effect The average treatm en t effect (ATE) is th e sum o f th e subject-level treatm en t effects, 7 (1 ) — 7 (0 ), divided by th e total n u m b e r o f subjects. A n equivalent way to express the ATE is to say th at it equals f i Y{l) - /jlY{()), w here /xy(1) is the average value o f 7.(1) for all subjects and /xy(0) is the average value o f 7.(0) for all subjects.
CAUSAL I N F E RE N C E AND EXPERIMENTATION
26
2.3
Random Sampling and Expectations
Suppose that instead o f calculating the average potential outcom e for all villages, we drew a random sample of villages and calculated the average am ong the villages we sampled. By random sample, we m ean a selection procedure in w hich v villages are selected from the list o f N villages, and every possible set o f v villages is equally likely to be selected. For example, if we select one village at random from a list o f seven vil lages, seven possible samples are equally likely. If we select three villages at random from a list of seven villages, Nl
_
_
7!
7X 6X 5X4X3X2X1
_
v !( N —v)! ~~ 3!4! ~~ ( 3 X 2 X 1 ) ( 4 X 3 X 2 X 1 ) ~~
5
(2'51
possible samples are equally likely. If potential outcom es vary from one village to the next, the average potential outcom e in the villages we sam ple will vary, depending on which o f the possible samples we happen to select. The sample average may be char acterized as a random variable, a quantity th at varies from sample to sample. The term expected value refers to the average outcom e o f a random variable. (See Box 2.3.) In ou r example, the random variable is the n u m ber we obtain w hen we sample villages at ran d o m and calculate their average outcom e. Recall from in tro d u c tory statistics that un d er random sampling, the expected value o f a sample average is equal to the average of the population from w hich the sample is draw n.2 This p rin ciple may be illustrated using the population o f villages depicted in Table 2.1. Recall that the average value of 7 (0 ) am ong all villages in Table 2.1 is 15. Suppose we sample two villages at random from the list o f seven villages and calculate the average value of 7 (0 ) for the two selected villages. There are Nl
7!
v \ ( N - v ) \ ~ 2!5! ~~ 21
t2'61
possible ways o f sam pling two villages at random from a list o f seven,and each sam ple isequally likely to be draw n. Any given sample o f tw o villages m ight contain an average value o f 7 (0 ) that is higher or lower th an the true average o f 15, b u t the expected value refers to w hat we w ould obtain on average if we were to exam ine all 21 possible samples, for each one calculating the average value o f 7.(0): {10,12.5,12.5,12.5,12.5,12.5,12.5,15,15,15,15,15,15,15, 17.5,17.5,17.5,17.5,17.5,17.5, 20}.
(2.7)
2 The easiest way to see the intuition behind this principle is to consider the case in which we random ly sample just one village. Each village is equally likely to be sampled. The average over all seven possible samples is identical to the average for the entire population o f seven villages. This logic generalizes to samples where v > 1 because each village appears in exactly v /7 of all possible samples.
CAUSAL IN F E R E N C E AND EXPERIMENTATION
27
BOX 2.3 The expectation o f a discrete ra n d o m variable X is defined as £[X] = 2% P r[X = x], w here Pr[X = x] denotes th e probability th at X takes on the value x, and w here the sum m ation is taken over all possible values o f x. For exam ple, w hat is the expected value o f a ran d o m ly selected value o f t from Table 2.1? £ [r.]
=
E rP r[r(
=
t
]
= < - 5 ) ( i ) + ( 0 ) ( f ) + ( 5 ) ( i ) + d O ) ( f ) + < 1 5 ) ( ± ) = 5.
Properties of Expectations The expectation o f the constant a is itself: E[a] = a. For a random variable X and constants a and (3,E[a + (IX] — a + (3E[X], The expectation o f a sum o f tw o ra n d o m variables, X an d 7 , is the sum of th eir expectations: E[X + Y] = E[X] + E[ Y]. The expectation o f the p ro d u c t o f tw o ra n d o m variables, X and Y, is the p ro d u c t o f th eir expectations plus the covariance betw een them : E[XY] = E[X]E[Y] + E[(X - E[ X] ) ( Y - £ [7 ])].
The average o f these 21 n um bers is 15. In o th er w ords, th e expected value o f th e aver age 7 (0 ) obtained from a ra n d o m sam ple o f tw o villages is 15. The concept o f expectations plays an im p o rtan t role in the discussion th a t fol lows. Because we will refer to expectations so often, a bit m ore n o tatio n is helpful. The n o tation £[X] refers to th e expectation o f a ra n d o m variable X. (See Box 2.3.) The expression “the expected value o f 7 (0 ) w hen one subject is sam pled at ran d o m ” will be w ritten com pactly as £ [7 ( 0 )]. W h en a term like 7.(0) appears in conjunction w ith an expectations operator, it should be read n o t as the value o f 7.(0) for subject i b u t instead as a ran d o m variable th a t is equal to the value o f 7,(0) for a ran d o m ly selected subject. W hen the expression £[7.(0)] is applied to values in Table 2.1, the ran d o m variable is the ran d o m selection o f a 7.(0) from th e list o f all 7.(0); since there are seven possible ran d o m selections, the average o f w hich is 15, it follows th at E [ Y , m = 15.
28
CAUSAL IN F E R E N C E AND EXPERIMENTATION
Som etim es attention is focused on the expected value o f a ran d o m variable w ithin a subgroup. Conditional expectations refer to subgroup averages. In term s o f notation, the logical conditions following the | symbol indicate th e criteria that define the subgroup. For example, the expression “the expectation o f 7 ( 1) w hen one village is selected at ran d o m from those villages th at were treated” is w ritten £ [ 7 ( l ) |d . =
1]. The idea of a conditional expectation is straightforw ard w hen
w orking w ith quantities that are in principle observable. M ore m in d -b en d in g are expressions like £ [ 7 ( 1) | d. = 0], w hich denotes “the expectation o f 7 (1 ) w hen one village is selected at ran d o m from those villages th at were n o t tre a te d ” In the course o f conducting research, we will never actually see 7(1 ) for an u n treated village, nor will we see 7 (0 ) for a treated village. These potential outcom es can be im agined but n o t observed. O ne special type o f conditional expectation arises w hen the subgroup is defined by the outcom e o f a ran d o m process. In th at case, the conditional expectation may vary depending on w hich subjects happened to m eet the condition in any particular realization of the ran d o m process. For example, suppose th at a ran d o m process, such as a coin flip, determ ines w hich subjects are treated. For a given treatm ent assignm ent d., we could calculate £ [7 (1 ) | d = 0], but this expectation m ight have been different had the coin flips com e out differently. Suppose we w ant to know the expected conditional expectation, or how the conditional expectation w ould com e out, on average, across all possible ways th at d could have been allocated. Let D. be a random variable that indicates w hether each subject w ould be treated in a hypothetical experim ent. The conditional expectation £[7.(1) | D. = 0] is calculated by considering all possible realizations o f D. (all the possible ways th at N coins could have been flipped) in order to form the jo in t probability d istrib u tio n function for 7.(1) and D.. As long as we know the jo in t probability o f observing each paired set o f values {7(1), D}, we can calculate the conditional expectation using the form ula in Box 2.4.3 W ith this basic system of notation in place, we m ay now describe the connection betw een expected potential outcom es and the average treatm ent effect (ATE): E[Y.(l) -
7(0)] - £ [7 (1 )] - £ [7 (0 )] =
= ti \ j, =S1 W )
- j f S l'- . W )
-
y ,(°>i ” A T E '
1281
3 The notation £[7.(1) | D. = 0] m ay b e regarded as shorthand for £[£[ Y.(l) | d. = 0, d]]ywhere d refers to a vector o f treatm ent assignm ents and d refers its ith element. Given d , we may calculate the probability distribution function for all {7(1), d] pairs and the expectation given this set of assignments. Then we may take the expectation of this expected value by sum m ing over all possible d vectors.
CAUSAL IN FEREN C E AND EXPERIMENTATION
29
BOX 2.4 Definition: Conditional Expectation For discrete ran d o m variables 7 an d X, the conditional expectation o f Y given th at X takes on the value x is E[Y\X =
x
]
= 2 y P r[7 = y \X =
x]
P r [Y = y , X = x] = 2 y ----- ^ ------ ,
w here P r [Y = y, X = x] denotes the jo in t probability o f Y = y an d X = x, and w here the sum m ation is taken over all possible values o f y. For exam ple, in Table 2.1 w hat is the con d itio n al expectation o f a ran d o m ly selected value o f r , for villages w here 7.(0) > 1 0 ? This q uestion requires us to describe the join t probability d istrib u tio n fu n ctio n for the variables r . and 7.(0) so th at we can calculate P r[r. =
t
,
7.(0) > 10]. Table 2.1 indicates th at
the {r, 7(0)} pair {0, 15} occurs w ith probability 2 /7 , w hile the o th er pairs {5,10}, {10, 20}, {—5, 20}, {10,10}, and {15,15} each o ccu r w ith probability 1 /7 . The m arginal distribution o f 7.(0) reveals that 5 o f the 7 7.(0) are greater than 10, so P r[7 .(0 ) > 10] = 5 /7 . Pr[T. = r , 7 (0 ) > 1 0 ] E r . 7.(0) > 10 = Y r --------- — — ------ -----'' ' ^ P r[7 .(0 )> 1 0 ] 2
I = ( —5 ) j
+ (0 )| +
7
i (5 )j + (1 0 )| +
7
7
i (1 5 )| =
7
In o rd er to illustrate the idea o f a conditio n al ex pectation
4.
7
w hen co n d itio n
ing on the outcom e o f a ra n d o m process, suppose we ran d o m ly assign one o f the observations in Table 2.1 to tre a tm e n t (D. = 1) an d th e rem ain in g six observations to control (D, = 0). If each o f th e seven possible assignm ents occurs w ith probability 1 /7 , w hat is the expected value o f a ran d o m ly selected r. given that D. = 1? Again, we start w ith the jo in t probability density fu n c tio n for r. and D. and consider all possible pairings o f these tw o variables’ val ues. The {t, D} pairings {—5,1}, {5,1}, an d {15,1} o ccu r w ith probability 1 /4 9 , w hile the pairings {0,1} an d {10,1} occur w ith probability 2 /4 9 ; the rem aining {r, D} pairings are instances in w hich r is paired w ith 0. The m arginal d istrib u tio n Pr[D. = 1] = 3 (1 /4 9 ) + 2 (2 /4 9 ) = 1 /7 . E [t
_ P r[r. = r , D. = 1] . | D. = 1] = V t — -------------- !--------- '1 ' ^ Pr[D . = 1] = ( - 5 ) f + ( 0 ) f + ( 5 ) f + ( 1 0 ) f + ( 1 5 ) f = 5. 7
7
7
7
7
30
CAUSAL IN F E R E N C E AND EXPERIMENTATION
The first line o f equation (2.8) expresses th e fact th at w hen a village is selected at ran d o m from the list o f villages, its expected treatm en t effect is equal to the differ ence betw een the expected value o f a random ly selected treated potential outcom e and the expected value of a random ly selected untreated potential outcom e. The second equality in equation (2.8) indicates th at the expected value o f a random ly selected 7.(1) equals the average of all 7.(1) values, and th at the expected value of a random ly selected 7 (0 ) equals the average o f all 7 (0 ) values. The th ird equality reflects the fact th a t the difference betw een the tw o averages in the second line of equation (2.8) can be expressed as the average difference in potential outcom es. The final equality notes th at the average difference in potential outcom es is the defini tio n o f the average treatm en t effect. In sum , the difference in expectations equals the difference in average potential outcom es for the entire list o f villages, or the ATE.4 This relationship is apparent from the schedule of potential outcom es in Table 2.1. The colum n of num bers representing the treatm ent effect ( r ) is, on average, 5. If we were to select villages at random from this list, we w ould expect their average treat m ent effect to be 5. We get the same result if we subtract the expected value o f a ran dom ly selected 7 (0 ) from the expected value o f a random ly selected 7(1).
2M
Random A ssignm ent and Unbiased Inference
The challenge of estim ating the average treatm en t effect is that at a given po in t in tim e each village is either treated or not: either 7 (1 ) or 7 (0 ) is observed, b u t not both. To illustrate the problem , Table 2.2 shows w hat outcom es w ould be observed if Village 1 and Village 7 were treated, while the rem aining villages were not. We observe 7(1) for Villages 1 and 7 b u t not 7 (0). For Villages 2, 3 ,4 ,5 , and 6, we observe 7 (0 ) but not 7 (1 ). The unobserved or “m issing” values in Table 2.2 are indicated w ith a “?”.
4 The notation used here is just one way to explicate the link between expectations and the ATE. Samii and Aronow (2012) suggest an alternative formalization. Their model envisions a finite population U consist ing of units j in 1, 2 , . . . , N, each o f which has an associated triple (^.(1), ^(0 ), D' ) such that ^.(1) and y.(0) are fixed potential outcom es and D ' is a random variable indicating the treatm ent status o f unit j. Reassign a random index ordering i in 1, 2 , . . . , N. Then, for an arbitrary unit i, there exists an associated triple of random variables (7.(1), 7 (0 ), D.) such that the random variable 7. = D.Y^ 1) + (1 — D .)7(0). It follows that for equation (2.8): £ [y ,(l)l - E[Y,.(0)] = ^ X ' 1 , 7 / 1 ) -
= ATE.
Statistical operators such as expectations or independence refer to random variables associated with an arbitrary index i. Looking ahead to later chapters, one m ight expand this system to include other unit-level attributes, such as covariates or missingness, by attaching them to the triple indexed by j before reassigning the ordering.
CAUSAL IN F ER E N C E AND EXPERIMENTATION
31
TABLE 2.2 I l l u s t r a t i o n of o b s e r v e d o u t c o m e s f o r l o c a l b u d g e t s w h e n t w o v i l l a g e c o u n c i l s a r e h e a d e d by w o m e n . /to)
K11)
Village i
Budget share if village head is male
Budget share if village head is female
Treatment effect
Village 1
?
15
?
Village 2
15
?
?
Village 3
20
?
?
U
20
?
?
Village 5
10
?
?
Village 6
15
?
?
Village 7
?
30
?
16
22.5
6.5
Village
Estimated average based on observed data
T. /
Note: The observed outcomes in this table are based on the potential outcomes listed in Table 2.1.
R andom assignm ent addresses the “m issing data” problem by creating two groups o f observations th at are, in expectation, identical p rio r to application o f the treatm ent. W hen treatm ents are allocated random ly, the tre a tm e n t group is a ran d o m sam ple o f all villages, and therefore the expected p o ten tial outcom es am ong villages in the treatm ent group are identical to the average potential outcom es am ong all villages. The sam e is tru e for villages in the control group. The control g ro u p s expected p o te n tial outcom es are also identical to the average po ten tial outcom es am ong all villages. Therefore, in expectation, the tre a tm e n t g roups p o ten tial outcom es are the sam e as the control groups. A lthough any given ra n d o m allocation o f villages to tre atm en t and control groups m ay produce groups o f villages th at have different average p o te n tial outcom es, this procedure is fair in th e sense th a t it does n o t te n d to give one group a h ig h er set o f potential outcom es th a n the other. As C hattopadhyay and Duflo p o in t out, ra n d o m assignm ent is in fact used in ru ral India to assign w om en to h e ad o n e -th ird o f the local village councils.5 O rd i narily, m en w ould head th e village councils, b u t In d ian law m andates th at selected
5 C hattopadhyay and Duflo 2004. A lottery is used to assign council positions to w om en in Rajasthan. In W est Bengal, a n ear-random assignm ent procedure is used w hereby villagers are assigned according to their serial num bers.
32
CAUSAL I N F E R E N C E AND EXPERIMENTATION
villages install a female representative as head o f the council. For purposes o f illustra tion, suppose that o u r collection o f seven villages were subject to this law, and that two villages will be random ly assigned female council heads. C onsider the statistical im plications of this arrangem ent. This random assignm ent procedure implies that every village has the same probability o f receiving the treatm ent; assignm ent bears no system atic relationship to villages' observed or unobserved attributes. Lets take a closer look at the form al im plications o f this form o f random assign m ent. W hen villages are assigned such th at every village has the sam e probability o f receiving the treatm ent, the villages that are random ly chosen for treatm ent are a random subset o f the entire set of villages. Therefore, the expected 7 (1 ) potential outcom e am ong treated villages is the sam e as the expected 7 (1 ) potential outcom e for the entire set o f villages: £[7 (1 )|L > -
1] = £ [7 (1 )].
(2.9)
BOX 2.5 Two Com m only Used Form s of Random A ssign m en t R andom assignm ent refers to a procedure th at allocates treatm ents with know n probabilities that are greater than zero and less than one. The m ost basic form s o f random assignm ent allocate treatm ents such that every subject has the sam e probability of being treated. Let N be the n u m b er of subjects, and let m be the expected n um b er o f subjects w ho will be assigned to the treatm ent group. Assume that N and m are integers such that 0 < m < N. Simple ran d o m assignm ent refers to a procedure w hereby each subject is allo cated to the treatm en t group w ith probability m / N . Com plete ran d o m assign m ent refers to a procedure that allocates exactly m units to treatm ent. U nder simple or complete random assignm ent, the probability o f being assigned to the treatm ent group is identical for all subjects; therefore treatm ent status is statistically independent of the subjects' potential outcom es and their background attributes (X): 7 (0 ), 7 (1 ), X 1 D., w here the sym bol JL m eans “is independent of.” For example, if a die roll is used to assign subjects to treatm ent w ith probability 1 /6 , know ing w hether a subject is treated provides no inform ation about the subjects potential o u t com es or background attributes. Therefore, the expected value o f 7 (0 ), 7 (1 ), and X. is the sam e in treatm ent and control groups.
CAUSAL IN F E R E N C E AND EXPERIMENTATION
33
W hen we random ly select villages into the treatm en t group, the villages we leave b eh in d for the control group are also a ra n d o m sam ple o f all villages. The expected 7 (1 ) in the control group (D. = 0) is therefore equal to the expected 7 (1 ) for the entire set o f villages: £ [ 7 ( 1 ) |D . = 0] = £ [7 (1 )].
(2.10)
P utting equations (2.9) and (2.10) together, we see th at u n d e r ra n d o m assignm ent the treatm en t and control groups have the sam e expected potential outcom e: £ [ 7 ( 1 ) |D ; = 1] = £ [7 ( 1 ) | D. = 0],
(2.11)
Equation (2.11) also underscores th e distin ctio n betw een realized and unrealized potential outcom es. O n the left side o f the equation is the expected treated potential outcom e am ong villages th at receive the treatm ent. The treatm en t causes this p o te n tial outcom e to becom e observable. O n the right side o f the equation is the expected treated potential outcom e am ong villages th at do n o t receive the treatm en t. H ere, the lack o f treatm en t m eans th at the treated potential outcom e rem ains u n observed for these subjects. The sam e logic applies to the control group. Villages th at do n o t receive the tre a t m ent (D. = 0) have the sam e expected un treated potential outcom e 7 (0 ) th a t the treatm en t group (D. = 1) w ould have if it were untreated: £ [ 7 ( 0 ) |D . = 0] = £ [7 ( 0 ) | D. = 1] - £ [7 (0 )].
(2.12)
Equations (2.11) and (2.12) follow from ra n d o m assignm ent: D. conveys n o in fo rm a tion w hatsoever about the potential values o f 7 (1 ) o r 7 (0 ). The ran d o m ly assigned values o f D. determ in e w hich value o f 7 we actually observe, b u t they are nevertheless statistically in d ep en d en t o f the potential outcom es 7 (1 ) an d 7 (0 ). (See Box 2.5 for discussion o f the term independence.) W h en treatm ents are assigned random ly, we m ay rearrange equations (2.8), (2.11), and (2.12) in order to express the average tre atm en t effect as ATE = £ [7 (1 ) | D. = 1] - £ [7 ( 0 ) | D. = 0].
(2.13)
This equation suggests an em pirical strategy for estim ating the average treatm en t effect. The term s £ [7 (1 ) | (D . = 1)] an d £ [7 (0 ) | (D. = 0)] m ay be estim ated
using
experim ental data. We do not observe th e 7 (1 ) potential outcom es for all o b serv a tions, bu t we do observe th em for the ran d o m sam ple o f observations th at receive the treatm ent. Similarly, we do n o t observe the 7 (0 ) p o ten tial outcom es for all observations, but we do observe them for the ran d o m sam ple o f observations in the control group. If we w ant to estim ate the average treatm en t effect, equation (2.13) suggests that we should take the difference betw een tw o sam ple m eans: the average
CAUSAL IN F E R E N C E AND EXPERIMENTATION
outcom e in the treatm ent group m inus the average outcom e in the control group. Ideas that enable researchers to use observable quantities (e.g., sample averages) to reveal param eters of interest (e.g., average treatm ent effects) are term ed identification strategies. Statistical procedures used to m ake guesses about param eters such as the aver age treatm en t effect are called estimators. In this example, the estim ator is very sim ple, just a difference betw een two sam ple averages. Before applying an estim ator to actual data, a researcher should reflect on its statistical properties. O ne especially im p o rtan t p ro p erty is unbiasedness. An estim ator is unbiased if it generates the right answer, on average. In o th er w ords, if the experim ent were replicated an infinite n u m b er o f tim es u n d e r identical conditions, the average estimate w ould equal the tru e param eter. Som e guesses m ay be too high and others too low, b u t the average guess will be correct. In practice, we will n o t be able to perform an infinite num ber o f experim ents. In fact, we m ight just perform one experim ent and leave it at that. N evertheless, in th eo ry we can analyze the properties o f o u r estim ation procedure to see w hether, o n average, it recovers the right answer. (In the next chapter, we consider an o th er p ro p erty o f estim ators: how precisely they estim ate the param eter o f interest.) In sum , w hen treatm ents are adm inistered using a procedure th at gives every subject the sam e probability of being treated, potential outcom es are independent of the treatm ents th at subjects receive. This p roperty suggests an identification strategy for estim ating average treatm ent effects using experim ental data. The rem aining task is to dem onstrate th at the proposed estim ator—the differ ence betw een the average outcom e in the treatm en t group and the average outcom e in the control g ro u p —is an unbiased estim ator o f the ATE w hen all subjects have the sam e probability o f being treated. The p ro o f is straightforw ard. Because the units assigned to th e control group are a ran d o m sam ple o f all units, the average of the control group outcom es is an unbiased estim ator o f the average value o f 7(0)
Definition: Estim ator and Estim ate An estim ator is a procedure or form ula for generating guesses about p aram eters such as the average treatm ent effect. The guess th at an estim ator generates based on a particular experim ent is called an estim ate. Estim ates are denoted A
using a “hat” notation. The estim ate o f the p aram eter 6 is w ritten 6.
CAUSAL IN FER EN C E AND EXPERIMENTATION
35
am ong all units. The sam e goes for the tre a tm e n t group: th e average outcom e am ong u nits th at receive the treatm en t is an un b iased estim ato r o f the average value o f 7 (1 ) am ong all units. Formally, if we ra n d o m ly shuffle the villages an d place the first m subjects in the tre atm en t group and th e rem ain in g N — m subjects in the control group, we can analyze the expected, o r average, o u tcom e over all possible ra n d o m assignm ents: Average outcom e Average outcom e am ong treated am ong untreated units units
Y
V ' m y 1 i
m +1
m
N
—
m
i
= E
m
-
E
f l y , ] + E[Y2] + ■ • ■ + g [ y j
4-1 y ±i
N
—
m
E[Y mtl] + E[ Y_ J
m
+ • ■■ + E [ r j
N - m
= E[ Y. ( l ) \ D. = 1] - E[Yj( 0 ) \ D i = 0] = E[Y.( 1)] - E[Y,(0)] = E [ t ] = ATE.
(2.14]
Equation (2.14) conveys a sim ple b u t extrem ely useful idea. W h en un its are ran d o m ly assigned, a com parison o f average outcom es in tre a tm e n t an d control groups (the so-called difference-in-means estimator) is an unbiased estim ato r o f th e average tre a t m en t effect.
BOX 2.7 Definition: Unbiased Estim ator An estim ator is unbiased if the expected value o f th e estim ates it produces is equal to the tru e p aram eter o f interest. Call 6 the p aram eter we seek to estim ate, such as the ATE. Let 6 represent an estim ator, o r pro ced u re for generating estim ates. For exam ple, 9 m ay represent the difference in average outcom es betw een treatm en t and control groups. The expected value o f this estim ator is the average estim ate we w ould obtain if we apply th is estim ato r to all p o s sible realizations o f a given experim ent or observational study. We say th a t 6 A
/V
is unbiased if E(6) = 0; in w ords, the estim ator 6 is u nbiased if the expected value o f this estim ator is 6 , th e p aram eter o f interest. A lthough unbiasedness is a p ro p erty o f estim ators an d n o t estim ates, we refer to the estim ates generated by an unbiased estim ator as “unbiased estim ates.”
CAUSAL I N F E R E N C E AND EXPERIMENTATION
36
2.5
The Mechanics of Random Assignm ent
The result in equation (2.14) hinges on random assignm ent, and so it is im portant to be clear about w hat constitutes random assignm ent. Simple random assignment is a term of art, referring to a procedure—a die roll or coin toss—th at gives each sub ject an identical probability o f being assigned to the treatm ent group. The practical draw back of sim ple random assignm ent is th at w hen N is small, random chance can create a treatm ent group th at is larger or sm aller th an w hat the researcher intended. For example, you could flip a coin to assign each o f 10 subjects to the treatm ent con dition, but there is only a 24.6% chance o f ending up w ith exactly 5 subjects in treat m ent and 5 in control. A useful special case o f simple random assignm ent is complete random assignment, w here exactly m of N units are assigned to the treatm en t group w ith equal probability.6 The procedure used to conduct com plete ran d o m assignm ent can take any of three equivalent forms. Suppose one has N subjects and seeks to assign treatm ents to m o f them . The first m ethod is to select one subject at random , then select another at random from th e rem aining units, and so forth until you have selected m subjects into the treatm en t group. A second m eth o d is to enum erate all o f the possible ways th at m subjects m ay be selected from a list o f N subjects, and random ly select one of the possible allocation schemes. A th ird m eth o d is to random ly perm ute the order o f all N subjects and label the first m subjects as the treatm ent group.7 Beware o f the fact th at random is a w ord th at is used loosely in com m on par lance to refer to procedures that are arbitrary, haphazard, or unplanned. The problem is th at arbitrary, haphazard, or unplanned treatm ents m ay follow systematic patterns th at go unnoticed. Procedures such as alternation are risky because there m ay be system atic reasons why certain types o f subjects m ight alternate in a sequence, and indeed, som e early m edical experim ents ran into exactly this problem .8 We use the term random in a m ore exacting sense. The physical or electronic procedure by w hich random ization is conducted ensures that assignm ent to the treatm ent group is statis tically ind ep en d en t o f all observed or unobserved variables.
6 In Chapters 3 and 4, we discuss other frequently used m ethods o f random assignment: clustered ran dom assignm ent, w here groups o f subjects are random ly assigned to treatm ent and control, and block random assignm ent (also called stratified random assignm ent), where individuals are first divided into blocks, and then random assignm ent is perform ed w ithin each block. Box 2.5 notes that a defining feature of complete (as opposed to clustered or blocked) random assignm ent is that all possible assignments of N subjects to a treatm ent group o f size m are equally likely. 7 Cox and Reid 2000, p. 20. The term complete randomization is a bit awkward, as the word complete does not convey the requirem ent that exactly m units are allocated to treatm ent, but this term inology has becom e standard (see Rosenbaum 2002, pp. 25-26). 8 Hrobjartsson, Gotzsche, and G luud 1998.
CAUSAL IN F ER E N C E AND EXPERIMENTATION
37
In practical term s, ran d o m assignm ent is best do n e using statistical software. Here is an easy procedure for im plem enting com plete ran d o m assignm ent. First, determ ine N , the n u m b er o f subjects in y our experim ent, and m, the n u m b er o f su b jects w ho will be allocated to the tre a tm e n t group. Second, set a ran d o m n u m b e r “seed” using a statistics package, so th at your ra n d o m n u m b ers m ay be rep ro d u ced by anyone w ho cares to replicate y o u r work. Third, generate a ra n d o m n u m b e r for each subject. Fourth, sort the subjects by th eir ra n d o m n u m b ers in ascending order. Finally, classify the first m observations as the treatm en t group. Exam ple p rogram s using R m ay be found at http://isps.research.yale.edu/FE D A I. G enerating ran d o m n um bers is ju st the first step in im plem enting ran d o m assignm ent. A fter the num bers are generated, one m u st take pains to preserve the integrity o f the assignm ent process. A deficiency o f altern atio n an d m an y o th er arb i trary procedures is th at they allow those ad m in isterin g the allocation to foresee w ho will be assigned to w hich experim ental group. If a receptionist seeks to get the sickest patients into the experim ental tre atm en t group and know s th a t the p a tte rn o f assign m ents alternates, he can reo rd er the patients in such a way as to shuttle the sickest subjects into the treatm en t group.9 The sam e co ncern arises even w hen a ran d o m sequence o f num bers is used to assign incom ing patients: ra n d o m allocation m ay be u n d o n e if the receptionist know s th e o rd er o f assignm ents ahead o f tim e, because th at enables him to position patients so th at they will be assigned to a certain ex p eri m ental group. In o rd er to guard against p otential th reats to the integrity o f ra n d o m assignm ent, researchers should build extra pro ced u ral safeguards into th eir ex p eri m ental designs, such as b linding those ad m in isterin g the ex p erim en t to the subjects’ assigned experim ental groups.
2.6
The Threat of Selection Bias When Random A ss ig n m e n t Is Not Used
W ith o u t ra n d o m assignm ent, the identification strategy derived from equation (2.14) unravels. The treatm en t and control groups are no longer ra n d o m subsets o f all un its in the sample. Instead, we confront w hat is know n as a selection problem : receiving treatm en t m ay be system atically related to potential outcom es. For exam ple, absent ran d o m assignm ent, villages d eterm in e w h eth er th eir councils are h ead ed by w om en. The villages th a t end up w ith fem ale council heads m ay n o t be a ran d o m subset o f all villages.
9 For examples o f experim ents in w hich random assignm ent was subverted, see Torgerson and Torger son 2008.
38
CAUSAL I N F E R E N C E AND EXPERIMENTATION
To see how n o n ran d o m selection jeopardizes the identification strategy o f com paring average outcom es in the treatm ent and control groups, rewrite the expected dif ference in outcom es from equation (2.13) by subtracting and adding E[Y.(0) | D. — 1]: E[Y. (l)\D. = 1] - E[Yi( 0 ) \ D i = 0] ^
Expected difference between treated and untreated outcomes
= E[Yt( 1) - Y.(0) ID. = 1] + E[Yi( 0 ) \ D i = 1] - E[Yi( 0 ) \ D i = 0], [2.15] y ---------------------- — ^ -------- — ATE am ong the treated
Selection bias
U nder random assignm ent, the selection bias term is zero, and the ATE am ong the (random ly) treated villages is the sam e as the ATE am ong all villages. In the absence o f random assignm ent, equation (2.15) w arns that the apparent treatm ent effect is a m ixture o f selection bias and the ATE for a subset o f villages. In order to appreciate the im plications o f equation (2.15), consider the follow ing scenario. Suppose that instead of random ly selecting villages to receive the treat m ent, our procedure were to let villages decide w hether to take the treatm ent. Refer back to Table 2.1 and im agine that, if left to their own devices, Village 5 and Village 7 always elect a w om an due to villagers5 p en t-u p dem and for water sanitation, while the rem aining villages always elect a m an .10 Self-selection in this case leads to an exaggerated estim ate o f the ATE because receiving the treatm ent is associated with low er-than-average values o f 7 (0 ) and higher-than-average values o f 7 (1 ). The aver age outcom e in the treatm ent group is 25, and the average outcom e in the control group is 16. The estim ated ATE is therefore 9, w hereas the actual ATE is 5. Referring to equation (2.15) we see that in this case the ATE am ong the treated is n o t equal to the ATE for the entire subject pool, n o r is the selection bias term equal to zero. The broader point is that it is risky to com pare villages th at choose to receive the treatm ent w ith villages that choose not to. In this example, self-selection is related to potential outcom es; as a result, the com parison o f treated and untreated villages recovers neither the ATE for the sam ple as a whole n o r the ATE am ong those villages that receive treatm ent. The beauty o f experim entation is that the random ization procedure generates a schedule o f treatm ent and control assignm ents that are statistically in dependent of
10 W hen taking expectations over hypothetical replications of an experim ent, we consider all possible random assignments. In our example of non-random allocation, however, nature m akes the assignment. W hen taking expectations, we m ust therefore consider the average of all possible natural assignments. Rather than m ake up an assortm ent o f possible assignments and stipulate the probability that each sce nario occurs, we have kept the example as simple as possible and assum ed that the villages “always” elect the sam e type of candidate. In effect, we are taking expectations over just one possible assignment that occurs w ith probability 1.
CAUSAL IN FEREN C E AND EXPERIMENTATION
39
potential outcom es. In o ther w ords, th e assum ptions underlying equations (2.9) to (2.13) are justified by reference to the procedure o f ran d o m assignm ent, no t su b sta n tive argum ents about the com parability o f potential outcom es in the treatm en t and control groups. I he preceding discussion should n o t be taken to im ply th at experim entation invokes no substantive assum ptions. The unbiasedness o f the difference-in-m eans estim ator hinges n o t only on ran d o m assignm ent bu t also on tw o assum ptions about potential outcom es, the plausibility o f w hich will vary d ep en d in g on the application. The next section spells out these im p o rtan t assum ptions.
2.7
Two Core A ss u m p t io n s about Potential Outcomes
To this point, o u r characterization o f potential outcom es has glossed over two im p o r tan t details. In o rd er to ease readers into the fram ew ork o f p o ten tial outcom es, we sim ply stipulated th a t each subject has tw o p oten tial outcom es, 7 (1 ) if treated and 7 (0 ) if n o t treated. To be m ore precise, each potential outcom e d ep en d s solely on w h eth er the subject itself receives th e treatm ent. W h en w riting p o ten tial outcom es in this way, we are assum ing th at p otential outcom es resp o n d only to th e treatm en t and no t som e o th er feature of the experim ent, such as the way the ex p erim en ter assigns treatm en ts or m easures outcom es. F urtherm ore, p o ten tial outcom es are defined over the set o f treatm ents th at the subject itself receives, n o t the treatm en ts assigned to o th er subjects. In technical parlance, th e “solely” assu m p tio n is te rm e d excludability and the “itself” assum ption is term ed non-interference.
2 .7.1
Excludability
W hen we define two, and only two, potential outcom es based on w h e th er the tre a t m en t is adm inistered, we im plicitly assum e th a t the only relevant causal agent is receipt o f the treatm ent. Because the p o int o f an ex p erim en t is to isolate the causal effect o f the treatm ent, our schedule o f potential outcom es excludes from co n sid er ation factors o th e r th a n the treatm ent. W h en co n d u ctin g an experim ent, therefore, we m ust define the treatm en t and distinguish it from o th er factors w ith w hich it m ay be correlated. Specifically, we m ust distinguish betw een d., the treatm en t, and z., a variable th at indicates w hich observations have b een allocated to treatm en t or c o n trol. We seek to estim ate th e effect o f d., and we assum e th a t th e tre atm en t assign m en t z. has no effect on outcom es except insofar as it affects the value o f d . The term exclusion restriction or excludability refers to th e assu m p tio n th at z can be om itted from the schedule o f potential outcom es for 7 (1 ) an d 7.(0). Formally, this
40
CAUSAL I N F E R E N C E AND EXPERIMENTATION
assum ption may be w ritten as follows. Let 7 (z , d ) be the potential outcom e w hen z , ~ z and d. = d y for z E ( 0 , l ) and for d £E (0, 1). For example, if z = 1 and d — 1, the subject is assigned to the treatm ent group and receives the treatm ent. We can also envision other com binations. For example, if z = 1 and d = 0, the subject is assigned to the treatm ent group b u t for som e reason does not receive the treatm ent. The exclusion restriction assum ption is that 7.(1, d) = 7.(0, d). In other words, potential outcom es respond only to the input from d ; the value o f z is irrele vant. Unfortunately, this assum ption cannot be verified em pirically because we never observe both 7 (1 , d) and 7 (0 , d) for the sam e subject. The exclusion restriction breaks dow n w hen random assignm ent sets in m otion causes o f 7. o th er th an the treatm ent d.. Suppose the treatm ent in our ru n n in g exam ple were defined as w hether or not a w om an council head presides over deliberations about village priorities. O u r ability to estim ate the effect o f this treatm en t w ould be jeopardized if nongovernm ental aid organizations, sensing that newly elected w om en will prioritize clean water, were to redirect th eir efforts to prom ote w ater sanitation to m ale-led villages. If outside aid flows to m ale-led villages, obviating the need for m ale village council leaders to allocate th eir budgets to w ater sanitation, the apparent difference betw een w ater sanitation budgets in councils led by w om en and councils led by m en will exaggerate the tru e effect o f the treatm ent, as defined above.11 Even if it were the case that w om en council leaders have no effect on th eir ow n villages1b u d gets, the behavior o f the N G O s could generate different average budgets in m ale- and fem ale-led villages. Asymmetries in m easurem ent represent another threat to the excludability assum p tion. Suppose, for example, that in o u r study o f Indian villages, we were to dispatch one group o f research assistants to m easure budgets in the treatm en t group and a different group o f assistants to m easure budgets in the control group. Each group o f assistants m ay apply a different standard w hen determ ining w hat expenditures are to be classified as contributing to water sanitation. Suppose the research assis tants in the treatm ent group were to use a m ore generous accounting stan d ard —they ten d to exaggerate the am ount o f m oney that the village allocates to w ater sanitation. W hen we com pare average budgets in the treatm ent and control groups, the estimated treatm ent effect will be a com bination o f the tru e effect o f female village heads on budgets and accounting procedures that exaggerate the am ount o f m oney spent on w ater sanitation in those villages. Presumably, w hen we envisioned th e experim ent and w hat we m ight learn from it, we sought to estim ate only the first o f these two effects. We w anted to know the effect o f female leadership on budgets using a consis tent standard of accounting.
11 W hether an excludability violation occurs depends on how a treatm ent effect is defined. If one were to define the effect of electing a wom an to include the com pensatory behavior o f NGOs, this assumption would no longer be violated.
CAUSAL INF ERE NC E AND EXPERIMENTATION
41
To illustrate the consequences o f m easu rem en t asym m etry, we m ay w rite out a sim ple m odel in w hich outcom es are m easured w ith error. U n d er this scenario, the usual schedule o f potential outcom es expands to reflect the fact th a t outcom es are influenced n o t only by d., b u t also by z., w hich d eterm in es w hich set o f research assistants m easure the outcom e. Suppose th a t am ong u n treated un its we observe 7 (0 )' = 7 (0 ) + e.Q, w here e.Qis th e e rro r th a t is m ade w hen m easu rin g the p o te n tial outcom e if an observation is assigned to the control group. For treated units, let 7.(1)‘ = 7.(1) + e.r W hat happens if we com pare average outcom es am ong treated and un treated units? The expected value o f th e difference-in-m eans estim ator from equation (2.14) is
1
m
i
m + 1
y
N — m
i
= £ [7 (1 )* | D. = 1] - £ [7 (0 )* | D. = 0]
= E[Yt( l ) \ D . = 1] + E[ea \ Di = 1] - E[Yj( 0 ) \ D i = 0] - E[e.Q\ Di = 0], (2.16) C om paring equation (2.16) to equation (2.14) reveals th at the difference-in-m eans estim ator is biased w hen th e m easu rem en t errors in th e treated and u n treated groups have different expected values: E[ea \D. = 1] * E [ e j D . = 0],
(2.17)
In this book, w hen we speak o f a “breakdow n in sym m etry,” we have in m in d p ro cedures th at m ay distort the expected difference betw een tre a tm e n t and control outcom es. W hat kinds o f experim ental procedures bolster th e plausibility o f the exclud ability assum ption? The broad answ er is anything th a t helps ensure u n ifo rm han d lin g o f treatm en t an d control groups. O n e type o f procedure is d o u b le-b lin d n ess—n eith er the subjects n o r the researchers charged w ith m easu rin g outcom es are aware o f w hich treatm ents th e subjects receive, so th at they c an n o t consciously or u n c o n sciously d isto rt the results. A n o th er procedure is parallelism in th e ad m in istratio n of an experim ent: th e sam e questionnaires an d survey interview ers should be used to assess outcom es in b o th tre atm en t and control groups, an d b o th groups’ outcom es should be gathered at approxim ately the sam e tim e an d u n d e r sim ilar conditions. If outcom es for the control group are gathered in O ctober, b u t outcom es in the tre a t m en t group are gathered in N ovem ber, sym m etry m ay be jeopardized. The exclusion restriction can n o t be evaluated unless th e researcher has stated precisely w hat sort o f treatm en t effect th e experim en t is in ten d ed to m easure and designed the experim ent accordingly. D ependin g on th e researcher’s objective, the control group m ay receive a special type o f tre atm en t so th a t the treatm en t vs. co n trol com parison isolates a p articu lar aspect o f the treatm en t. A classic exam ple o f a research design th a t attem pts to isolate a specific cause is a p harm aceutical trial in
42
CAUSAL I N F E R E N C E AND EXPERIMENTATION
w hich an experim ental pill is adm inistered to the treatm ent group while an identi cal sugar pill is adm inistered to the control group. The aim o f adm inistering a pill to both groups is to isolate the pharm acological effects o f the ingredients, holding constant the effect of m erely taking som e sort o f pill. In the village council exam ple, a researcher m ay wish to distinguish the effects o f female leadership of local councils from the effects of m erely appointing non-incum bents to th e headship. In principle, one could com pare districts w ith random ly assigned w om en heads to dis tricts w ith random ly assigned term limits, a policy th at has the effect o f bringing non -incum bents into leadership roles. This approach to isolating causal m echanism s is revisited again in C hapter 10, w here we discuss designs that attem pt to differentiate the active ingredients in a m ultifaceted treatm ent. Protecting the theoretical integrity of the treatm ent vs. control com parison is of param ount im portance in experim ental design. In the case o f the village budget study, the aim is to estim ate the budgetary consequences o f having a random ly allocated female village head, not the consequences o f using a different m easurem ent standard to evaluate outcom es in treatm ent and control villages. The sam e argum ent goes for other aspects of research activity that m ight be correlated w ith treatm ent assignm ent. For example, if the aim is to m easure the effect o f female leadership on budgets per se, bias m ay be introduced if one sends a delegation o f researchers to m onitor village council deliberations in w om en-headed villages only. Now the observed treatm ent effect is a com bination of the effect of female leadership and the effect of research observers. W hether one regards the presence o f the research delegation as a distortion of m easurem ent or an unintended pathway by w hich assignm ent to treatm ent affects the outcom e, the form al structure of the problem rem ains the same. The expected outcom e of the experim ent no longer reveals the causal effect we set out to estimate. The sym m etry requirem ent does n ot rule out cross-cutting treatm ents. For exam ple, one could im agine a version o f India’s reservation policy that random ly assigned some village council seats to wom en, others to people from lower castes, and still others to w om en from lower castes. W hen we discuss factorial designs in C hapter 9, we will stress w hat can be learned from deploying several treatm ents in com bination w ith one another. The point of these m ore complex designs is to learn about com bina tions of treatm ents while still preserving sym m etry: random ly assigning treatm ents both alone and in com bination w ith one another allows the researcher to distinguish empirically betw een having a female village head and having a female village head w ho is also from a lower caste. Finally, lets revisit the case in w hich o ther actors intervene in response to your treatm ent assignm ents. For example, suppose that in anticipation o f greater sp en d ing on w ater sanitation, interest groups devote special attention to lobbying village councils headed by wom en. O r it m ay go the other way: interest groups focus greater efforts on villages headed by m en because they believe th a ts w here they will m eet the m ost resistance from budget m akers. W hether interest group interference violated
CAUSAL IN F E R E N C E AND EXPERIMENTATION
43
the assum ption o f excludability depends on how we define the tre atm en t effect. In ter est group activity presents no th reat to th e exclusion restrictio n if we define th e effect of installing a female council head to include all o f the in d irect repercussions th at it could have on interest group activity. If, however, we seek to estim ate the specific effect o f having fem ale council heads w ithout any interference by interest groups, o u r experim ental design m ay be inadequate unless we can find a way to prevent in ter est groups from responding strategically. These kin d s o f scenarios again underscore the im p o rtance o f clearly stating the experim ental objectives so th at researchers and readers can assess the plausibility o f the exclusion restriction.
2 .7.2
N o n -In te rfe r e n c e
For ease o f presentation, the above discussion only briefly m en tio n ed an assu m p tion that plays an im p o rtan t role in the definition an d estim ation o f causal effects. This assum ption is som etim es d ubbed the Stable U nit T reatm ent Value A ssum ption, or SUTVA, but we refer to it by a m ore accessible nam e, no n -in terferen ce.12 In the no tatio n used above, expressions such as Y.(d) are w ritten as th o u g h the value o f the potential outcom e for u n it i depends only u p o n w h eth er o r n o t the u n it itself gets the treatm en t (w hether d equals one or zero). A m ore com plete n o tatio n w ould express a m ore extensive schedule o f potential outcom es d ep en d in g on w hich tre a t m ents are adm inistered to o th er units. For exam ple, for Village 1 we could w rite dow n all o f the potential outcom es if only Village 1 is treated, if only Village 2 is treated, if Villages 1 and 2 are treated, an d so forth. This schedule o f p o ten tial outcom es quickly gets out o f hand. Suppose we listed all o f the poten tial outcom es if exactly two o f the seven villages are treated: there w ould now be 21 p o ten tial outcom es for each village. Clearly, if o ur study involves ju st seven villages, we have no ho p e o f saying anything m eaningful about this com plex array o f causal effects unless we m ake som e sim plify ing assum ptions. The non-interference assum ption cuts th ro u g h this com plexity by ignoring the potential outcom es th a t w ould arise if subject i were affected by th e treatm en t o f oth er subjects. Formally, we reduce the schedule o f p o ten tial outcom es Y(d)> w here d describes all of the treatm ents adm inistered to all subjects, to a m uch sim pler sched ule Y.(d)y w here d refers to the treatm en t ad m in istered to subject iP In the context o f ou r exam ple, non-interference im plies th at the san itatio n budget in one village is unaffected by the gender of the council heads in o th er villages. N o n -in terferen ce is an assum ption com m on to both experim ental an d observational studies.
12 The term “stable” in SUTVA refers to the stipulation that the potential outcom es for a given village rem ain stable regardless o f w hich oth er villages happen to be treated. T he technical aspects o f this term are discussed in Rubin 1980 and Rubin 1986. 13 Implicit in this form ulation o f potential outcom es is the assum ption that potential outcom es are unaf fected by the overall pattern o f actual or assigned treatm ents. In other words, Y.(z, d) = Y (z, d).
CAUSAL I N F E R E N C E AND EXPERIMENTATION
Is non-interference realistic in this example? It is difficult to say w ithout m ore detailed inform ation about com m unication betw een villages and the degree to which th eir budget allocations are interdependent. If the collection o f villages were dis persed geographically, it m ight be plausible to assum e th at the gender o f the village head in one village has no consequences for outcom es in o ther villages. O n the other hand, if villages were adjacent, the presence o f a w om an council head in one vil lage m ight encourage w om en in other villages to express th eir policy dem ands m ore forcefully. Proxim al villages m ight also have interdependent budgets; the m ore one village spends on w ater sanitation, the less the neighboring village needs to spend in order to m aintain its ow n w ater quality. The estim ation problem s that interference introduces are potentially quite com plicated and unpredictable. U ntreated villages that are affected by the treatm ents that nearby villages receive no longer constitute an untreated control group. If women council heads set an example of w ater sanitation spending that is then copied by neigh boring villages headed by m en, a com parison betw een average outcom es in treatm ent villages and (sem i-treated) control villages will tend to understate the average treat m ent effect as defined in equation (2.3), w hich is usually u n d ersto o d to refer to the contrast betw een treated potential outcom es and com pletely untreated potential o u t comes. O n the other hand, if female council heads cause neighboring villages headed by m en to free ride on w ater sanitation projects and allocate less o f their budget to it, the apparent difference in average budget allocations will exaggerate the average treatm ent effect. Given the vagaries of estim ation in the face o f interference, research ers often try to design experim ents in ways that m inim ize interference between units by spreading them out tem porally or geographically A nother approach, discussed at length in C hapter 8, is to design experim ents in ways that allow the researcher to detect spillover betw een units. Instead o f treating interference as a nuisance, these m ore complex experim ental designs aim to detect evidence o f com m unication or stra tegic interaction am ong units.
SUMMARY This chapter has lim ited its purview to a class o f random ized experim ents in which treatm ents are deployed exactly as assigned and outcom es are observed for all o f the assigned subjects. This class of studies is a natural starting point for discussing core assum ptions and w hat they im ply for research design. The chapters that follow will introduce fu rth er assum ptions in order to handle the com plications th at arise due to noncom pliance (C hapters 5 and 6) and attrition (C hapter 7). We began by defining a causal effect as the difference between two potential out comes, one in w hich a subject receives treatm ent and the other in which the subject does not receive treatm ent. The causal effect for any given subject is not directly observ
CAUSAL I N F E R E N C E AND EXPERIMENTATION
45
able. However, experim ents provide unbiased estim ates o f the average treatm ent effect (ATE) am ong all subjects w hen certain assum ptions are met. The three assum ptions invoked in this chapter are ran d o m assignm ent, excludability, and non-interference. 1. R andom assignm ent: T reatm ents are allocated such th at all un its have a know n probability betw een 0 and 1 o f being placed into the tre a tm e n t group. Simple ran d o m assignm ent or com plete ra n d o m assignm ent im plies th at treatm en t assignm ents are statistically in d e p en d e n t o f the subjects’ potential outcom es. This assum ption is satisfied w hen all tre atm en t assignm ents are d eterm in ed by the sam e ran d o m procedure, such as th e flip o f a coin. Because ran d o m assignm ent m ay be com prom ised by those allocating treatm en ts o r assisting subjects, steps should be taken to m inim ize the role o f discretion. 2. Excludability: Potential outcom es respond solely to receipt o f the treatm en t, no t to the ran d o m assignm ent o f the treatm en t o r any in d irect by-products o f ra n dom assignm ent. The treatm en t m u st be defined clearly so th at one can assess w h eth er subjects are exposed to the in ten d ed tre atm en t or so m eth in g else. This assum ption is jeopardized w hen (i) different procedures are used to m easure outcom es in the tre a tm e n t and control groups and (ii) research activi ties, o th er treatm ents, or th ird -p a rty in terv en tio n s o th e r th a n th e treatm en t o f interest differentially affect the tre atm en t an d control groups. 3. N on-interference: Potential outcom es for observation i reflect only the tre a t m en t or control status o f observation i an d n o t the treatm en t or control status o f o th er observations. N o m atter w hich subjects the ra n d o m assignm ent allocates to treatm en t or control, a given subject’s poten tial outcom es rem ain the same. This assum ption is jeopardized w hen (i) subjects are aw are o f th e treatm en ts that o th er subjects receive, (ii) treatm ents m ay be tran sm itted fro m treated to un treated subjects, or (iii) resources used to treat one set o f subjects d im in ish resources th at w ould otherw ise be available to o th e r subjects. See C hapter 10 for a m ore extensive list o f examples. R andom assignm ent is different from the o th e r tw o assum ptions in th at it refers to a procedure and the m a n n er in w hich researchers carry it out. Excludability and non-interference, on the o th er h and, are substantive assum ptions ab o u t the ways in w hich subjects respond to the allocation o f treatm ents. W h en assessing excludability and no n -in terference in th e context o f a p articu lar experim ent, the first step is to carefully consider how the causal effect is defined. D o we seek to study the effect o f electing w om en to village council positions or ra th e r th e effect o f electing w om en from a pool o f candidates th at consists only o f w om en? W h en defining the tre a t m en t effect o f installing a fem ale village council head, is th e appropriate com parison a village w ith m ale leadership, o r a m ale-led village w ith n o neig h b o rin g fem ale-led villages? A ttending to these subtleties encourages a researcher to design m ore exact ing experim ental com parisons and to in terp ret the results w ith greater precision.
46
CAUSAL I N F E R E N C E AND EXPERIMENTATION
A ttentiveness to these core assum ptions also helps guide experim ental investiga tion, urging researchers to explore the em pirical consequences o f different research designs. A series of experim ents in a particular dom ain m ay be required before a researcher can gauge w hether subjects seem to be affected by the ran d o m assign m ent over and above the treatm ent (a violation o f excludability) or by the treatm ents adm inistered to o th er units (interference). SUGGESTED READINGS Holland (1986) and Rubin (2008) provide non-technical introductions to potential outcomes notation. Fisher (1935) and Cox (1958) are two classic books on experimental design and analy sis; Dean and Voss (1999) and Kuehl (1999) offer more modern treatments. See Rosenbaum and Rubin (1984) on the distinctive statistical properties of randomly assigned treatments.
EXERCISES: CHAPTER 2 1.
Potential outcomes notation: (a) Explain the notation “7.(0).” (b) Explain the notation “7.(0) | D. = 1” and contrast it with the notation “7(0) | d. — 1.” (c) Contrast the meaning of “7 (0 )” with the meaning of “7 (0) | D. = 0.” (d) Contrast the meaning of “7.(0) | D. = 1” with the meaning of “7(0) | D . = 0.” Contrast the meaning of “£ [7 (0 )]” with the meaning of “£ [7 (0 ) | D. = 1].” Explain why the “selection bias” term in equation (2.15), £ [7 (0 )|D . = 1] — £ [7 (0 ) | D. = 0], is zero when D is randomly assigned. UsethevaluesdepictedinTable2.1toillustratethat£[7.(0)] — £[7 (1 )] = £ [7 (0 ) — 7(1)]. (e) (f)
2. 3.
Use the values depicted in Table 2.1 to complete the table below. (a) Fill in the number of observations in each of the nine cells. (b) Indicate the percentage of all subjects that fall into each of the nine cells. (These cells represent what is known as the joint frequency distribution of 7(0) and 7(1).) (c) At the bottom of the table, indicate the proportion of subjects falling into each category of 7(1). (These cells represent what is known as the marginal distribution of 7(1).) (d) At the right of the table, indicate the proportion of subjects falling into each category (e)
(f)
of 7(0) (i.e., the marginal distribution of 7(0)). Use the table to calculate the conditional expectation that £ [ 7 ( 0 ) |7 ( 1 ) > 15]. (Hint: This expression refers to the expected value of 7.(0) given that 7.(1) is greater than 15.) Use the table to calculate the conditional expectation that £ [7 (1 ) | 7(0) > 15], 7(1) 7 (0 )
15
20
30
M arginal distribution of 7(0)
10 15 20 Marginal distribution of 7 (1 )
1.0
CAUSAL IN FEREN C E AND EXPERIMENTATION
4.
5.
47
Suppose that the treatment indicator d is either 1 (treated) or 0 (untreated). Define the av erage treatment effect among the treated, or ATT for short, as ^ i T^ J 2 i Using the equations in this chapter, prove the following claim: “When treatments are allocated us ing complete random assignment, the ATT is, in expectation, equal to the ATE. In other words, taking expectations over all possible random assignments, E [r. \D. = 1] = £[r.], where r. is a randomly selected observations treatment effect. A researcher plans to ask six subjects to donate time to an adult literacy program. Each subject will be asked to donate either 30 or 60 minutes. The researcher is considering three methods for randomizing the treatment. One method is to flip a coin before talk ing to each person and to ask for a 30-minute donation if the coin comes up heads or a 60-minute donation if it comes up tails. The second method is to write “30” and “60” on three playing cards each, and then shuffle the six cards. The first subject would be as signed the number on the first card, the second subject would be assigned the number on the second card, and so on. A third method is to write each number on three different slips of paper, seal the six slips into envelopes, and shuffle the six envelopes before talk ing to the first subject The first subject would be assigned the first envelope, the second subject would be assigned the second envelope, and so on. (a) (b)
Discuss the strengths and weaknesses of each approach. In what ways would your answer to (a) change if the number of subjects were 600
(c)
instead of 6? What is the expected value of D (the assigned number of minutes) if the coin toss method is used? What is the expected value of D. if the sealed envelope method is used?
6.
Many programs strive to help students prepare for college entrance exams, such as the SAT. In an effort to study the effectiveness of these preparatory programs, a researcher draws a random sample of students attending public high school in the United States, and
7.
compares the SAT scores of those who took a preparatory class to those who did not. Is this an experiment or an observational study? Why? Suppose that an experiment were performed on the villages in Table 2.1, such that two vil lages are allocated to the treatment group and the other five villages to the control group. Suppose that an experimenter randomly selects Villages 3 and 7 from the set of seven villages and places them into the treatment group. Table 2.1 shows that these villages have unusually high potential outcomes. (■a) Define the term unbiased estimator. (b) Does this allocation procedure produce upwardly biased estimates? Why or why not? (c) Suppose that instead of using random assignment, the researcher placed Villages 3
8.
and 7 into the treatment group because the treatment could be administered inex pensively in those villages. Explain why this procedure is prone to bias. Peisakhin and Pinto14 report the results of an experiment in India designed to test the effectiveness of a policy called the Right to Information Act (RTIA), which allows citi zens to inquire about the status of a pending request from government officials. In their study, the researchers hired confederates, slum dwellers who sought to obtain ration cards (which permit the purchase of food at low cost). Applicants for such cards must fill out a
14 Peisakhin and Pinto 2010.
CAUSAL IN F E R E N C E AND EXPERIMENTATION
form and have their residence and income verified by a government agent. Slum dwellers widely believe that the only way to obtain a ration card is to pay a bribe. The researchers instructed the confederates to apply for ration cards in one of four ways, specified by the researchers. The control group submitted an application form at a government office; the RTIA group submitted a form and followed it up with an official Right to Information request; the NGO group submitted a letter of support from a local nongovernmental organization (NGO) along with the application form; and finally, a bribe group submitted an application and paid a small fee to a person who is known to facilitate the processing of forms. Bribe
RTIA
NGO
Control
N um ber of confederates in the study
24
23
18
21
N um ber of confederates who had residence verification
24
23
18
20
M edian num ber o f days to residence verification
17
37
37
37
N um ber of confederates w ho received a ration card w ithin one year
24
20
3
5
(a)
Interpret the apparent effects of the treatments on the proportion of applicants who have their residence verified and the speed with which verification occurred. (b) Interpret the apparent effects of the treatments on the proportion of applicants who
9.
10.
actually received a ration card. (c) What do these results seem to suggest about the effectiveness of the Right to Infor mation Act as a way of helping slum dwellers obtain ration cards? A researcher wants to know how winning large sums of money in a national lottery affects peoples views about the estate tax. The researcher interviews a random sample of adults and compares the attitudes of those who report winning more than $10,000 in the lottery to those who claim to have won little or nothing. The researcher reasons that the lottery chooses winners at random, and therefore the amount that people report having won is random. (a) Critically evaluate this assumption. (Hint: are the potential outcomes of those who report winning more than $10,000 identical, in expectation, to those who report winning little or nothing?) (b) Suppose the researcher were to restrict the sample to people who had played the lottery at least once during the past year. Is it now safe to assume that the potential outcomes of those who report winning more than $10,000 are identical, in expecta tion, to those who report winning little or nothing? Suppose researchers seek to assess the effect of receiving a free newspaper subscription on students5 interest in politics. A list of student dorm rooms is drawn up and sorted randomly. Dorm rooms in the first half of the randomly sorted list receive a newspaper at their door each morning for two months; dorm rooms in the second half of the list do not receive a paper. (a) University researchers are sometimes required to disclose to subjects that they are participating in an experiment. Suppose that prior to the experiment, researchers distributed a letter informing students in the treatment group that they would be
CAUSAL IN F E R E N C E AND EXPERIMENTATION
receiving a newspaper as part of a study to see if newspapers make students more interested in politics. Explain (in words and using potential outcomes notation) how this disclosure may jeopardize the excludability assumption.
11.
(b) Suppose that students in the treatment group carry their newspapers to the cafeteria where they may be read by others. Explain (in words and using potential outcomes notation) how this may jeopardize the non-interference assumption. Several randomized experiments have assessed the effects of drivers’ training classes on the likelihood that a student will be involved in a traffic accident or receive a ticket for a moving violation.15 A complication arises because students who take drivers' training courses typically obtain their licenses faster than students who do not take a course.16 (The reason is unknown but may reflect the fact that those who take the training are better prepared for the licensing examination.) If students in the control group on average start driving much later, the proportion of students who have an accident or receive a ticket could well turn out to be higher in the treatment group. Suppose a researcher were to compare the treatment and control group in terms of the number of accidents that occur within three years of obtaining a license. (a) Does this measurement approach maintain symmetry between treatment and con trol groups? (b) Would symmetry be maintained if the outcome measure were the number of acci dents per mile of driving? (c)
Suppose researchers were to measure outcomes over a period of three years start ing the moment at which students were randomly assigned to be trained or not. Would this measurement strategy maintain symmetry? Are there drawbacks to this approach?
12.
A researcher studying 1,000 prison inmates noticed that prisoners who spend at least three hours per day reading are less likely to have violent encounters with prison staff. The researcher therefore recommends that all prisoners be required to spend at least three hours reading each day. Let d be 0 when prisoners read less than three hours each day and 1 when prisoners read more than three hours each day Let Yf(0) be each prisoners potential number of violent encounters with prison staff when reading less than three hours per day, and let Y .(l) be each prisoners potential number of violent encounters when reading more than three hours per day. (a) In this study, nature has assigned a particular realization of d to each subject. When assessing this study, why might one be hesitant to assume that E[Y.(0) | D. = 0] = E [Y (0)|D (. = 1] and E[Y.(l) | D. — 0] = £ [7 ,(1 ) |D , = 1]? (b) Suppose that researchers were to test this researchers hypothesis by randomly assigning 10 prisoners to a treatment group. Prisoners in this group are required to go to the prison library and read in specially designated carrels for three hours each day for one week; the other prisoners, who make up the control group, go about their usual routines. Suppose, for the sake of argument, that all prisoners in the treatment group in fact read for three hours each day and that none of the prisoners
15 See Roberts and Kwan 2001. 16 Vernick et al. 1999.
50
CAUSAL I N F E R E N C E AND EXPERIMENTATION
in the control group read at all during the week of the study. Critically evaluate the excludability assumption as it applies to this experiment. (c) State the assumption of non-interference as it applies to this experiment. (d) Suppose that the results of this experiment were to indicate that the reading treat ment sharply reduces violent confrontations with prison staff. How does the non-interference assumption come into play if the aim is to evaluate the effects of a policy whereby all prisoners are required to read for three hours?
CHAPTER 3
S a m p lin g Distributions, Statistical Inference, and Hypothesis Testing
R
igorous quantification o f u n certain ty is a h allm ark o f scientific inquiry. W h en analyzing experim ental data, th e aim is n o t only to generate u nbiased estim ates o f the average treatm en t effect b u t also to draw inferences about the u n certain ty
su rro u n d in g these estim ates. A m ong the m o st attractive features o f ex p erim en tatio n is th at ran d o m allocation o f treatm ents is a reproducible procedure. R eproducibility allows us to assess the sam pling distribution, o r collection o f estim ated ATEs th at could have com e about u n d er different ra n d o m assignm ents in o rd er to b etter u n d e r stand the un certain ty associated w ith th e experim en t we conducted. O ne objective
o f this chapter is to explain how experim ental design affects th e sam pling d istrib u tion. We consider ways of designing experim ents so as to reduce sam pling variability, and we call attention to the fact th a t the sam pling d istrib u tio n m ay change m arkedly d ep en d in g on the procedures used to random ly allocate subjects to treatm en t and control conditions. A second objective is to guide the reader th ro u g h the calculation and in te rp re ta tio n o f key statistical results. W hen analyzing an experim ent, you should consider b o th the estim ated ATE an d th e u n certain ty w ith w hich it is estim ated. Unless you have p rio r inform ation about the value o f the ATE, the experim ental estim ate is one’s best guess o f the tru e treatm en t effect, b u t this guess m ay be close to o r far from the tru e average causal effect. Statisticians com m o n ly assess u n certain ty in two ways. O ne m e th o d is to investigate w h eth er th e experim ental results are sufficiently in fo r m ative to refute a d eterm in ed skeptic w ho insists th a t th ere is no treatm en t effect w hatsoever. A n o th er approach is to identify a range o f values th a t probably bracket the tru e average treatm ent effect. This chapter in tro d u ces a flexible set o f statistical techniques th at m ay be used to assess u n certain ty across a w ide array o f different experim ental designs.
51
S A M P L I N G D I STR IBUT IO NS AND STATISTICAL I N F E R E N C E
52
W hat you w ill learn from this chapter: 1. How to quantify the uncertainty su rro u n d in g an experimental estimate. 2. F o rm u la s that s u g g e s t w ays to design more informative experiments. 3. How to refute a determined skeptic who advances the "n ull h y p o th e sis” that the treatment h a s no effect. 4. H ow to generate confidence intervals that have a 9 5 % chance of bracketing the true sa m p le average treatment effect.
3.1
Sampling Distributions
O ne of the m ost im p o rtan t topics in experim ental design and analysis is sampling variability. W hen exam ining the results of a single experim ent, we m ust bear in m ind that we have in front o f us just one o f the m any possible datasets th at could have been generated via random assignm ent. The experim ent we happened to conduct yields an estim ate of the average treatm ent effect, b u t h ad the same observations been ra n dom ly assigned in a different way, o u r estim ate m ight have been quite different. The term sampling distribution refers to the collection o f estim ates that could have been generated by every possible random assignm ent.1 To illustrate th e idea o f a sam pling distribution, lets retu rn to the village c o u n cil experim ent discussed in the previous chapter. In th at study, tw o o f the seven
BOX 3.1 Definition: Sam p lin g Distribution of Experim ental Estim ates A sam pling distribution is the frequency distribution o f a statistic obtained from hypothetical replications o f a random ized experim ent. For example, if one were to conduct the sam e experim ent repeatedly u n d er identical condi tions, the collection of estim ated average treatm ent effects from each replica tion o f the experim ent form s a sam pling distribution. U nder the central lim it theorem , the sam pling distribution o f the estim ated average treatm ent effect takes the shape of a norm al distribution as the n u m ber o f observations in treatm ent and control conditions increases.
1 The distribution o f estimates from all possible random izations is also called the random ization distri bution. See Rosenbaum 1984.
S A M P L I N G D IS TRI BUT IO NS AND STATISTICAL I N F E R E N C E
53
villages w ere random ly assigned to receive th e treatm en t, w hich in this case is ap p o in tm en t o f a w om an to th e p o sitio n o f village council head. Em pirically, we h ap p en to observe one p a rtic u la r realization o f th a t ran d o m izatio n , illustrated in Table 2.2. But p rio r to our ra n d o m allocation, th ere w ere 21 different ways to place tw o o f the seven villages into th e tre a tm e n t group, an d each o f these 21 allocations had th e sam e probability o f being selected. U sing the schedule o f p o ten tial outcom es listed in Table 2.1, we m ay generate th e hypothetical ex p erim en tal results th a t each of the 21 possible ran d o m izatio n s w ould have pro d u ced . In o th e r w ords, for each possible random ization, we calculate th e average b u d g et allocation in th e tre atm en t and control groups, and calculate th e difference in m eans. The results are displayed in Table 3.1. The first th ing to note about the results in Table 3.1 is th a t the average estim ated ATE is 5, w hich is exactly th e sam e as th e tru e ATE in Table 2.1. This is n o co in cidence. In fact, the exercise underscores a very im p o rtan t feature o f ran d o m ized experim ents: given three core assum ptions discussed in C h ap ter 2 (ran d o m assign m ent, excludability, and non-interference), the average estim ated ATE across all
TABLE 3.1
S am p li n g distribution of estimated A T E s generated wh en two of the seven villages listed in Table 2.1 are ass ign ed to treatment Estimated ATE
Average Total
Frequency with which an estimate occurs
-1
2
0
2
0.5
1
1
2
1.5
2
2.5
1
6.5
1
7.5
3
8.5
3
9
1
9.5
1
10
1
16
1
5 21
S A M P L I N G D IS T RI BU T IO NS AND STATISTICAL I N F E R E N C E
54
possible random assignm ents is equal to the tru e ATE. Any single experim ent m ight give a n um ber th at is too high or too low, but the expected value o f this estim ation procedure is the tru e ATE. O ne of the great virtues o f experim ents is th at they gen erate unbiased estim ates of the ATE: m erely by subtracting the control group m ean from the treatm ent group m ean, we obtain an estim ator th at on average recovers the true ATE. The next thing to note about the results in Table 3.1 is that the 21 possible exper im ents generate quite different results. The largest estim ated ATE is 16—we had a l-in-21 chance of obtaining an estim ate that was m ore th an three tim es the size of the tru e ATE of 5. Two of the 21 random izations produce an estim ated ATE o f —1. H ad we obtained an estim ate of —1, we m ight have been led to believe that w om en village council heads tend to reduce the share o f budgets directed at w ater sanitation, even though the opposite is true. The dispersion o f estim ates aro u n d the true ATE rem inds us that w hile experim ents are unbiased, they are not necessarily precise. W ith just seven observations, o u r experim ent generates results that vary m arkedly from one random ization to the next.
3.2
The Standard Error as a Measure of Uncertainty
In order to describe the precision w ith w hich an experim ent recovers the ATE, we need a statistic th at characterizes the am o u n t o f sam pling variability. Sampling variability is typically expressed by reference to the standard error. The larger the stan dard error, th e m ore uncertain ty su rro u n d s o u r param eter estim ate. The stan dard erro r is the stan d ard deviation o f the sam pling distribution. It is obtained by calculating the squared deviation o f each estim ate from the average estim ate, divid ing by the n u m b e r of possible random izations, and taking the square root o f the result. Based on the num bers in Table 3.1, we calculate the standard erro r as follows: Sum o f squared deviations =
( - 1 - 5)2 + ( - 1 - 5)2 + (0 - 5)2 + (0 - 5)2 + (0.5 -
5)2 + (1
+
(1 - 5)2 + (1.5 - 5)2 + (1.5 - 5)2 + (2.5 - 5)2 + (6.5 - 5)2 + (7.5 -
5)2
+
(7.5 - 5)2 + (7.5 - 5)2 + (8.5 - 5)2 + (8.5 - 5)2 + (8.5 - 5)2 +
5)2
+
(9.5 - 5)2 + (10 - 5)2 + (16 - 5)2 = 445.
Square root o f the average squared deviation ~ y — (445) = 4.60.
- 5)2 (9 -
(3.1)
W hen a param eter is estim ated using an unbiased estim ator (such as difference-inm eans), a helpful rule of thum b is that approxim ately 95% o f the experim ental o u t com es fall w ithin an interval that ranges from two standard errors below the true
S A M P LI N G D IS TRI BUT IO NS AND STATISTICAL I N F E R E N C E
55
BOX 3.2 Definition: Standard Deviation The standard deviation o f a variable X is
w here X denotes the m ean o f X . N otice th a t this form ula divides by N . W hen X is a ra n d o m sam ple from a larger pop u latio n containing N * subjects w hose m ean is unknow n, the estim ate o f th e p o pulatio n stan d ard deviation is
Definition: Standard Erro r The standard e rro r is the standard deviation o f a sam pling distribution. Suppose there are / possible ways o f random ly assigning subjects. Let 6 represent the estim ate (denoted by the “hat” m ark: A) we obtain from th e ;th random ization, A
/V
and 9 represent the average estim ate for all /. For exam ple, 6 m ay represent the difference-in-m eans estim ate from one o f the possible ra n d o m assignm ents. O ver all / possible ran d o m assignm ents, the stan d ard e rro r o f 9 is
y ljV A
~
param eter to tw o standard errors above it. Given a tru e p aram eter o f 5 and a stan d ard e rro r o f 4.60, this interval stretches from —4.20 to 14.20. (In Table 3.1, we see th a t in fact this rule o f th u m b w orks well for this example: 20 o f the 21 estim ates, o r 95%, fall in this range. The approxim ation on w hich this rule o f th u m b rests tends to becom e m ore accurate as sam ple size increases.) At th e low en d o f this interval, we infer that appointing a female council head dim inishes budgets for w ater sanitation, and at the top o f this interval we grossly exaggerate the extent to w hich w om en leaders allocate m ore m oney tow ard this policy area. H ow can we reduce the stan d ard error? In o th e r w ords, how can we design o u r experim ent so th at it produces m ore precise estim ates o f the ATE? In o rd er to answ er this question, let’s inspect a form ula th at expresses th e stan d ard e rro r as a fu n ctio n o f the potential outcom es and the experim ental design. Before getting to the form ula itself, we first define som e key term s. A variance o f an observed o r potential outcom e
56
S A M P L I N G D I STR IBUT IO NS AND STATISTICAL I N F E R E N C E
BOX 3.3 Rule of Thum b for Sam p lin g Distributions of the Estim ated ATE For random ized experim ents w ith a given standard error, approxim ately 95% o f all random assignm ents will generate estim ates o f the ATE th at fall w ithin ± 2 standard errors from the tru e ATE. For example, if the standard error is 10 and the ATE is 50, approxim ately 95% o f the estim ates will be betw een 30 and 70. This rule o f thum b works best w hen N is large.
for a set o f N subjects is the average squared deviation o f each subjects value from the m ean for all N subjects. For example, the variance o f 7.(1) is: Y a r d 'd ) ) =
( I 'd ) -
13.21
The higher the variance, the greater the dispersion around the m ean. The smallest possible variance is zero, w hich implies th at the variable is a constant. N ote that the variance is the square of the standard deviation. For m ore details on this form ula, see Box 3.2. To obtain th e covariance betw een two variables, such as 7.(1) and 7.(0), subtract the m ean from each variable, and then calculate the average cross-product o f the result: c o v (y (o ),Y .( i» = ^ 2 f ( v , ( 0 ) -
- ^ 7
^ )
l3-3'
The covariance is a m easure of association betw een tw o variables. A negative covari ance im plies that lower values o f one variable tend to coincide w ith higher values of the o th er variable. A positive covariance m eans th at higher values o f one variable ten d to coincide w ith higher values of the o ther variable. Applying these form ulas to the schedule o f potential outcom es listed in Table 2.1, we find the variance o f 7.(0) to be 14.29, and the variance o f 7.(1) to be 42.86. The covariance o f 7.(0) and 7.(1) is 7.14. In order to obtain the standard erro r associated w ith the experim ental estim ate o f the average treatm ent effect, the rem aining step is to decide how m any of the N observations are to be treated. We will call m the n u m ber of treated units. In our example, m = 2 and N = 7. In general, we require that 0 < m < N , because if m were zero or equal to N , we w ould have ju st one experi m ental group rath er than two. For the sam e reason, we also require N > 1. W ith all of the ingredients in place, we are now ready to write the formula for the standard error of the estimated ATE. The equation places a “hat” m ark over the estimand in order to indicate that we are talking about an estimate of the ATE, not the true ATE:
SA M P LI N G DISTR IBU TIO NS AND STATISTICAL I N F E R E N C E
57
BOX 3.4 Properties of Variances and Covariances (a) C ov(7.(0), 7.(0)) -
V ar(7.(0)) > 0
(b) C ov(y.(0), 7 /1 )) = c o v (y .(o ), 7 .(0) + r.) = V ar(7.(0)) + C ov(7.(0), r.) (c) C ov(a7.(0),
= a b C o v(7.(0), 7.(1))
Covariances are b o u n d ed by the restriction that: ,----------------------------V ar(7.(0)) + V a r(7 .(l)) C ov(7.(0), 7.(1)) < V V a r ( 7.(0))V a r(7.(1)) < ---------:------- ----------- !----- . The correlation betw een tw o variables 7.(0) and 7.(1) is C o v (7 .(0 ),7 .(l)) V V a r ( 7 ( 0 ) ) V a r ( 7 .( l) )
(3.4)
This form ula tells us w hich factors reduce the size o f th e stan d ard error.2 The form ula contains five inputs: N , m , V ar(7.(0)), V a r(7 .(l)), an d C ov(7.(0), 7.(1)). C hanging each in p u t one by one, w hile holding the o th er in p u ts constant, we can exam ine how the standard e rro r changes. H ere is a su m m ary o f the form ulas im plications for experim ental design: 1. The larger the N , the sm aller the standard error. H olding constant the o th er inputs (including m , the size o f th e treatm ent group) and increasing N m eans increasing the control group. As the control group grows, the first and th ird term s inside the braces o f equation (3.4) are dim inished by an expanding N — 1 denom inator. If the control group were infinite in size, the only source o f u n certain ty w ould com e from the treatm ent group. Som etim es adding subjects to the control group involves little or no additional cost. For experim ents in w hich treating subjects costs resources b u t leaving th em untreated does n o t (e.g., sending m ail to those in the treatm ent group while sending noth in g to those in the control), bring as m any additional subjects as possible into the control group. A sim ilar point holds 2 Equation (3.4) describes the tru e standard error, SE (ATE), w hich is n o t to be confused w ith the esti m ated standard error, SE (A T E ), calculated based on a particular experim ent.
S A M P L I N G D IS T RI BU T IO NS AND STATISTICAL I N F E R E N C E
58
for the treatm ent group: holding the control groups size constant, increasing m, the size of the treatm ent group, dim inishes the second and th ird term s in equa tion (3.4). O f course, where possible, increase the size o f both the control group and the treatm ent group, as this reduces all three term s in equation (3.4). 2. The sm aller the variance of Y(0) or Y (l), the sm aller the standard error. To m axim ize precision, conduct experim ents on observations that are as sim ilar as possible in term s of their potential outcom es. This principle has three design im plications. First, it encourages researchers to m easure outcom es as accurately as possible, as this dam pens variability Second, as discussed later in this chapter, blocking m ay be used to im prove precision by grouping observations w ith sim i lar potential outcom es.3 Third, as explained in C hapter 4, an o th er way to reduce variance in outcom es is to m easure outcom es in advance o f the experim ental intervention, som etim es know n as a pre-test, in addition to m easuring outcom es after the intervention via a post-test. Instead o f defining the experim ental o u t com e to be the score on the post-test, define the experim ental outcom e to be the change from pre-test to post-test. Change scores usually have less variance than post-test scores. 3. A ssum ing th at Y(0) and Y.(l) do vary, the sm aller the covariance betw een Y(0) and Y (l), the sm aller the standard error.4 A particularly favorable case occurs w hen the potential outcom es have negative covariance: that is, w here high values o f Y.(0) tend to coincide w ith low values o f Y (l). In order to see how this pattern m ight occur, w rite Y.(l) = Y(0) + r . Substituting for Y (l) allows us to express Cov(Y.(0),Y.(l)) as V ar(Y (0)) + Cov(Y .(0),r). (See Box 3.4.) So, w hen Y(0) and are negatively related (e.g., students w ith low baseline scores are m ost helped by the treatm ent), the covariance betw een Y.(0) and Y (l) m ay be close to zero or t
even negative. 4. A subtle im plication o f the form ula is that w hen the variances o f Y(0) and Y.(l) are similar, it is advisable to assign approxim ately half o f the observations to the treatm ent group, such that m ~ N f 2. W hen the potential outcom es have differ ent variances, invest additional observations to the experim ental condition with 3 The researcher could restrict attention to subjects with similar background attributes in order to re duce variance in 7.(0) andY .(l), but this approach has the drawback o f lim iting the generalizations that m ight be draw n from this narrow set of subjects. In order to overcome this lim itation, a researcher may conduct the experim ent w ithin several different blocks, each o f which contains subjects with similar back ground attributes. 4 Here is the underlying intuition for why positive covariance leads to larger standard errors. If high values o f 7.(1) tend to coincide w ith high values of 7.(0), then selecting a subject with a high potential outcom e into the treatm ent group leaves one fewer subject with high potential outcom es for the control group. Positive covariance between 7 (0) and 7.(1) therefore means that results are sensitive to the place m ent o f subjects into treatm ent or control. O n the other hand, if high values of 7,(1) tend to coincide with low values of 7.(0), then selecting a subject with high potential outcomes into the treatm ent group leaves one fewer subject w ith low potential outcom es for the control group. In this case, there is less sampling variability because the control group is “com pensated” for the fact that the treatm ent group received a high value of 7 (1 ). See section 3.6.1.
S A M P LI N G DISTR IBU TIO NS AND STATISTICAL IN F E R E N C E
59
greater variance. For exam ple, if V ar( Y (l)) > V ar( Y (0)), p u t a greater share of the observations into the treatm en t group. In practice, however, researchers sel dom know in advance w hich group is likely to have m ore variance an d therefore place equal num bers o f subjects in each condition. In o rd er to see the form ula at w ork, we fill in the values ob tain ed from Table 2.1 and obtain exactly the sam e n u m b e r we calculated from Table 3.1: SE( ATE) =
+ & & ■ * ? }_ + (2 )(7 .1 4 )| = 4.60.(3.5)
W hen we know the full schedule o f potential outcom es, equation (3.4) tells us the stan d ard deviation o f estim ated ATEs from all possible ra n d o m assignm ents. E quation (3.4) can be used to d em onstrate th at o u r design o f m = 2 is less th an optim al. Increasing th e n u m b e r o f treated units so th at m — 3 low ers the stan d ard e rro r to 3.7. Raising m to 4 lowers the stan d ard e rro r even further, to 3.3. The reason it is b etter to p u t m ore observations into the tre atm en t group th a n the control group is th at in this exam ple the treated p otential outcom es have m ore variance th a n the u n treated potential outcom es: V a r ( 7 ( l) ) > V ar(7.(0)). Raising m to 5 goes to o far in th at direction an d leads to a slight deterioratio n in precision. To sum m arize, standard errors are m easures o f u n certain ty ; th ey indicate the extent to w hich estim ates will vary across all possible ran d o m assignm ents. O ne co n sideration w hen designing experim ents is to keep stan d ard erro rs small. Two types o f inputs determ ine the stan d ard error: the schedule o f potential outcom es an d the n u m b er o f observations assigned to tre atm en t and control groups. The schedule o f potential outcom es is unobserved, b u t experim enters som etim es have o p p o rtu n i ties to design experim ents in ways th a t lim it th e m isch ief th a t highly variable p o te n tial outcom es m ay cause. V ariance in 7.(0) an d 7.(1) m ay be reduced by m easuring outcom es in a m ore precise m a n n er or by blocking observations into hom ogeneous subsets. Standard errors are also reduced by “R obin H o o d treatm en ts” th a t raise o u t com es am ong those w ith low 7.(0) w hile low ering outcom es am ong those w ith high 7.(0). (For exam ple, a regim ented physical fitness p ro g ram in a m ilitary academy, w here cadets have little spare tim e, m ight raise th e level o f fitness am ong the least fit w hile low ering it am ong top-level athletes.) If the cost p er subject is sim ilar in b o th experim ental groups, assign sim ilar n um bers to tre a tm e n t an d control, tilting the balance in favor o f the group th a t is expected to have m ore variable outcom es.
3.3
Estimating Sam pling Variability
The previous section illustrated the concept o f sam pling variability by show ing how stan d ard errors can be calculated from a know n schedule o f p o ten tial outcom es. This exercise is im p o rtan t because it suggests ways in w hich problem s o f statistical
60
S A M P L I N G D IS TR IBUT IO NS AND STATISTICAL I N F E R E N C E
uncertainty can be addressed through experim ental design. N ow th at we have an appreciation for w hat a standard error is, we take up the question th at arises w h en ever one estim ates an ATE for a particular set o f subjects.5 Suppose the researcher w ants to know the standard error in order to calibrate the uncertainty associated with this estim ate. The researcher has neither the com plete schedule of potential outcom es n o r results from all of the hypothetical random assignm ents th at could have allocated this set of observations to treatm ent and control groups. Instead, this researcher has results from a single random ization. Assessing uncertain ty is an estim ation problem . The tru e stan d ard erro r is unknow n. We seek to estim ate this unknow n quantity using data from a single experim ent. W hat h ints can a single experim ent provide about how o th er experi m ental assignm ents m ight have com e out? R eturning to equation (3.4), we see that the experim ent provides inform ation about four o f the five inputs th at generate stan dard errors. The n u m b er o f observations allocated to treatm en t and control is know n. The variance of outcom es in the u n treated potential outcom es 7 (0 ) can be estim ated using the observed outcom es in the control group. Because the control group is assigned at random , the variance th at we observe in the control group is an unbiased estim ate of the variance in 7 (0 ). The sam e approach can be used to estim ate the variance of 7.(1) based on the observed variance in the assigned tre a t m ent group. The one elem ent in this equation th at can n o t be estim ated em pirically is the covariance betw een 7.(0) and 7.(1), as we never observe b o th p o ten tial outcom es for the sam e subject. The stan d ard approach is to use a conservative estim ation for m ula th at is at least as large as equation (3.4) regardless o f the covariance betw een 7 (0 ) and 7 (1 ).6 The conservative form ula assum es th at the treatm en t effect is the sam e for all subjects, w hich im plies th at the correlation betw een 7 (0 ) and 7 (1 ) is l.O.7
5 The average treatm ent effect is the ATE for the “finite population” of subjects in ones experiment. If the experimental subjects are seen as a sample from a larger population, one may distinguish between the ATE in the sample and the ATE in the population. In Chapter 11, we take up the question of how to use the sample at hand to draw inferences about the ATE in the population. 6 To be m ore precise, the conservative estim ation approach tends to overestimate the true sampling vari ance, which is the square o f the standard error depicted in equation (3.4). One note o f caution: although the conservative estim ation form ula overestimates the true sampling variance on average, estim ation is subject to sampling variability. A given estimate o f the sampling variance using the conservative formula may still be sm aller than the true sampling variance. 7 See Samii and A ronow 2012, pp. 366-367 for pro of that equation (3.6), when squared, is a conservative estim ator o f equation (3.4), when squared. The conservative formula will give unbiased estimates of the sampling variance u n d er either of two scenarios. The first arises when the treatm ent effect . is the same for all subjects. The second arises when subjects are sampled at random from a large population prior to random assignm ent, and the objective is to estimate the population average treatm ent effect. W hen sub jects are sam pled random ly from a large population, the selection of one subject for the treatm ent group has no m aterial effect on the pool of available subjects that can be selected into the control group, which renders the covariance between 7.(0) and Y.(l) irrelevant. t
S A M P L I N G D IS TRI BUT IO NS AND STATISTICAL I N F E R E N C E
61
The form ula for estim ating the stan d ard e rro r o f the average treatm en t effect is (3.6)
w here the variances are estim ated using th e m observations o f potential outcom es from the treatm en t group:
and the N — m observations o f p otential outcom es from the control group:
N ote th at w hen calculating the sam ple variances in equations (3.7) and (3.8), we divide by one less th a n the n u m b e r o f observations to take into account the fact th at one observation is expended w hen we calculate the sam ple m ean .8 In o rd e r to avoid dividing by zero w hen estim ating the stan d ard error, we m u st have at least tw o su b jects in each experim ental group. H ow closely do the em pirical estim ates9 o f the stan d ard errors m atch the tru e stan d ard errors? U sing the conservative form ula, the average stan d ard e rro r in o u r exam ple is 4.65, w hich is n o t far from th e tru e stan d ard e rro r o f 4.60. A lthough the estim ates vary from one ra n d o m allocation to the next, on average, the estim ator depicted in equation (3.6) does a reasonably good job o f estim ating the tru e level of sam pling variability.
3.4
Hypothesis Testing
The previous section illustrated the challenges th a t arise w hen estim ating the sam pling u n certain ty su rro u n d in g a single e x p erim en ts estim ate o f the ATE. A ccurate estim ation o f the standard e rro r requires accurate guesses ab o u t the variances and covariance o f potential outcom es. If we do n o t have m u ch data ab o u t the variances o r if o u r sim plifying assum ption about the unobservable covariance is m istaken, o u r estim ated stan d ard errors m ay be inaccurate.
8 If we know the m ean and the outcom es for all but one o f the subjects, we can deduce the outcom e for the rem aining subject. In statistical parlance, calculating the m ean expends one degree o f freedom. 9 The 21 estim ated standard errors are {1.581, 1.871, 1.871, 1.871, 1.871, 1.871, 2.236, 2.784, 2.958, 3.122, 3.122, 5.244, 5.339, 7.599, 7.665, 7.730, 7.730, 7.730, 7.730, 7.826, 7.826}.
62
S A M P L I N G D I STR IBUT IO NS AND STATISTICAL I N F E R E N C E
M uch easier is the task of testing the sharp null hypothesis th at the treatm ent effect is zero for all observations. W hat m akes it easy is that if the null hypothesis is true, 7(0) — 7(1). U nder this special case, we observe both potential outcom es for every observation. We may therefore take the observed outcom es in o u r dataset and sim ulate all possible random izations as though we were w orking from a com plete schedule of potential outcom es. These sim ulated random izations provide an exact sampling distribution of the estimated average treatm ent effect under the sharp null hypothesis. By looking at this distribution o f hypothetical outcom es, we can calculate the probability of obtaining an estim ated ATE at least as large as the one we obtained from ou r actual experim ent if the treatm ent effect were in fact zero for every subject. For example, the random ization depicted in Table 2.2 generated an estim ate of the ATE o f 6.5. H ow likely are we to obtain an estim ate as large as or larger th an 6.5 if the tru e effect were zero for all observations? The probability, o r p-value, o f interest in this case addresses a one-tailed hypothesis, nam ely th at female village council heads increase budget allocations to w ater sanitation. If we sought to evaluate the two-tailed hypothesis—w hether female village council heads either increase o r decrease the budget allocation for w ater sanitation—we w ould calculate the p-value o f obtaining a n u m ber that is greater than or equal to 6.5 or less th an or equal to —6.5. Based on the observed outcom es in Table 2.2, we m ay calculate th e 21 possible estim ates of the ATE that could have been generated if the null hypothesis were true: { -7 .5 , - 7 .5 , - 7 .5 , - 4 .0 , - 4 .0 , - 4 .0 , - 4 .0 , - 4 .0 , - 0 .5 , - 0 .5 , - 0 .5 , - 0 .5 , - 0 .5 , - 0 .5 , 3.0, 3.0, 6.5, 6.5, 6.5,10.0,10.0}. Five of the estim ates are as large as 6.5. So w hen eval uating the one-tailed hypothesis that female village heads increase w ater sanitation budgets, we w ould conclude that the probability o f obtaining an estim ate as large as 6.5 if the null hypothesis were tru e is 5/2 1 = 24%. A tw o-tailed hypothesis test w ould count all instances in w hich the estim ates are at least as great as 6.5 in absolute value. Eight o f the estim ates qualify, so the tw o-tailed p-value is 8 /2 1 = 38%. In theory, this type of calculation could be perform ed on experim ents o f any size, but in practice the num ber of possible random assignm ents becom es astronom ical as N increases. For example, an experim ent where N = 50 and h alf the observa-
Definition: Sh arp N ull Hypothesis of No Effect The treatm ent effect is zero for all subjects. Formally, 7(1) = 7(0) for all i. Definition: N u ll Hypothesis of No Average Effect The average treatm ent effect is zero. Formally, ft
= /xy{0).
S A M P L I N G DIS TR IB UT IO NS AND STATISTICAL I N F E R E N C E
63
tions are assigned to the treatm ent group potentially generates m ore th a n 126 trillion random izations: 50! — -— - = 126,410,606,437,752. 25125!
(3.9)
W hen the n u m b e r o f possible random izations is large, one can closely approxim ate the sam pling distribution by sam pling at ra n d o m from the set o f all possible ra n dom assignm ents. W h eth er one uses all possible ran d o m izatio n s o r a large sam ple o f them , the calculation o f p-values based on an inventory o f possible random izations is called random ization inference. This approach to hypothesis testing has tw o attractive properties. First, the p ro cedures used to calculate p-values m ay be applied to a very b ro ad class o f hypotheses and applications. The m e th o d is n o t confined to large sam ples or norm ally d istrib uted outcom es. A ny sam ple size will do, an d the m e th o d can be applied to all sorts o f outcom es, including counts, durations, o r ranks. Second, the m e th o d is exact in the sense th at the set o f all possible ra n d o m assignm ents fully describes th e sam pling dis trib u tio n u n d e r the null hypothesis. By contrast, the hypothesis testing m eth o d s d is cussed in in tro d u cto ry statistics courses rely on an approxim ation in o rd er to derive the shape o f the sam pling distribution. (See Box 3.7.) A lthough exact and approxi m ate m eth ods will ten d to give very sim ilar answ ers w hen sam ples are large, we use ran d o m ization inference th ro u g h o u t the b o o k so th a t a single statistical approach m ay be applied to a broad array o f applications w ith o u t approxim ations o r additional assum ptions. Since the 1930s, it has been conventional to dub p-values th at are below 0.05 as statistically significant on the grounds that, u n d e r the null hypothesis, the researcher had less than a l-in-20 chance o f obtaining the result due to chance. The p-values in the village council experim ent were 0.24 and 0.38, w hich fail to m eet this standard. By the 0.05 standard o f statistical significance, the estim ate o f 6.5 does n o t provide a convinc ing refutation o f the null hypothesis o f no effect. The 0.05 standard is a m atter o f co n vention, no t statistical theory, but it is so deeply entrenched th at researchers should be prepared to indicate w hether their experim ental results are significant at the 0.05 level. A nticipating this concern, researchers often attem p t to forecast the probability th at th eir experim ent, w hen conducted, will lead to th e rejection o f th e null h y p o th esis at a given significance level, such as 0.05. This probability is te rm e d the statisti cal pow er o f the experim ent. To say th at an experim ental design has 80% power, for instance, m eans th a t 80% o f all possible ra n d o m assignm ents will pro d u ce observed results th at will lead to th e rejection o f th e null hypothesis in th e presence o f the posited treatm en t effect. Forecasting the pow er o f an ex p erim en t requires som e guessw ork. The appendix to this chapter illustrates how assum ptions are used w hen calculating the pow er of a p roposed experim ent. Be careful not to confuse statistical significance w ith substantive significance. A p aram eter estim ate that falls short o f the 0.05 th resh o ld m ig h t nevertheless be
64
S A M P L I N G D I STR IBUT IO NS AND STATISTICAL I N F E R E N C E
BOX 3.6 Definition: One-tailed and Two-tailed Hypothesis Tests A null hypothesis that specifies that the treatm ent effect is zero can be rejected by test statistics that are either very large or very small. This is called a two-tailed test. For example, an intervention that is believed to change outcom es (either positively or negatively) w ould be evaluated using a tw o-tailed test. A null hypothesis that specifies that the effect is zero o r less can be rejected by a large positive test statistic. (Similarly, a null hypothesis that specifies th at an effect is zero or m ore is rejected by a large negative test statistic.) This is called a one-tailed test. Therapeutic interventions are advanced in anticipation o f find ing positive effects, in w hich case the null hypothesis is that the effect is zero or less, and an appropriate test is one-tailed. Definition: p-value For a tw o-tailed test, the p-value is the probability o f obtaining a test statistic at least as large in absolute value as the observed test statistic, given that the null hypothesis is true. For example, suppose the null hypothesis is th at the treat m ent has no effect, and the test statistic is the estim ated ATE. If the estim ated ATE is 5 and the p-value is 0.20, there is a 20% chance o f obtaining an estimate as large as 5 (or as small as —5) sim ply by chance. If the alternative hypothesis is a positive (or negative) effect, a one-tailed test is used. Randomization Inference The sam pling distribution of the test statistic u n d er the null hypothesis is com puted by sim ulating all possible random assignm ents. W hen th e n u m ber of random assignm ents is too large to simulate, the sam pling distribution may be approxim ated by a large random sam ple o f possible assignm ents, p-values are calculated by com paring the observed test statistic to the distribution o f test statistics u n d e r the null hypothesis.
im portant and interesting. If female council heads in fact caused a 6.5 percentagepo int increase in budgetary allocations to w ater sanitation, the health consequences o f increasing the role o f w om en in local governm ent th ro u g h o u t the Indian co u n tryside could be profound. A lthough statistical uncertainty rem ains, the data have taught us som ething potentially useful. If this were the first experim ent o f its kind and we had no p rio r know ledge of the treatm en t effect, the estim ate o f 6.5 would still be ou r best guess of the tru e ATE, despite the fact that we can n o t rule out the
S A M P LI N G DISTR IBU TIO NS AND STATISTICAL IN F E R E N C E
BOX ___.3.7 &♦ C om paring Random ization Inference to T-Tests A pproxim ate m ethods for testing the sharp null hypothesis o f n o effect assum e that the sam pling distribution o f the difference-in-m eans estim ator has a p artic ular shape. For example, the t-test, w hich should be fam iliar to those w ho have taken an in tro d u cto ry statistics course, assum es th at the sam pling distribution o f the estim ated difference-in-m eans follows a t-d istrib u tio n , w hich is sim ilar to a n o rm al distribution. The t-test gives accurate p-values w hen outcom es in each o f the experim ental groups are d istributed norm ally. W h en outcom es are d istributed non-norm ally, approxim ate m eth o d s becom e increasingly accurate as the n u m b er o f subjects in each o f the experim ental conditions grows. To illustrate the difference betw een a t-test and random ization inference, we generated the hypothetical dataset show n below. The treatm ent and control groups each contain 10 subjects. The treatm ent is an encouragem ent to m ake a charitable donation, and the outcom e is the am o u n t o f m oney contributed. This outcom e is skewed to the right due to a few large donations. The treatm en t group average is 80, while the control group average is 10. The sam pling distribution u n d er the sharp null hypothesis is bim odal (due to the influence o f a few large donations), but conventional m ethods assum e th at the sam pling distribution is bell-shaped. Repeating the random assignm ent 100,000 tim es u n d er the sharp null hypothesis o f no effect, we find that the observed average treatm ent effect o f 70 has a one-tailed p-value o f 0.032. A t-test assum ing equal variances puts this p-value at 0.082; a t-test allowing for unequal variances declares the p-value to be 0.091. The t-test is inaccurate in this case because the outcom es are skewed and the num ber o f observations is small. Readers are encouraged to tinker w ith this example using the sim ulation code at http://isps.research.yale.edu/FED A I to see how the shape of the sam pling distribution changes as m ore subjects are added to the dataset. T r ea tm en t
D onation
T r ea tm en t
D on a tio n
T r e a tm e n t
D on ation
1
500
1
0
0
10
1
100
1
0
0
5
1
100
1
0
0
5
1
50
0
25
0
5
1
25
0
20
0
0
1
25
0
15
0
0
1
0
0
15
65
S A M P L I N G D I STR IBUT IO NS AND STATISTICAL I N F E R E N C E
66
hypothesis that the ATE is zero. The right way to th in k about an experim ental result th at is substantively significant b u t statistically insignificant is that it w arrants fur th er investigation. As we conduct fu rth er experim ents, o u r uncertainty will gradually dim inish, and we will be able to m ake a clearer determ ination about the tru e value o f the ATE. (See C hapter 11 for a discussion o f how sam pling variability dim inishes w hen pooling a series of repeated experim ents.) Conversely, don t be overly im pressed with statistically significant results w ithout reflecting on th e ir substantive significance. All else equal, the standard error declines in p ro p o rtio n to the square root of N . Im agine the n u m ber o f observations were to grow from 7 to 70,000 villages. This 10,000-fold increase in N w ould im ply a 100-fold decrease in standard error, from 4.6 to 0.046 percentage-points. If this enorm ous experim ent were to indicate that female village council heads increased sanitation budgets by an average of 0.1 percentage points, the finding w ould be statistically sig nificant at the 0.05 level but substantively trivial. O ne final caution about testing the null hypothesis th at 7(0) = 7(1): this equal ity represents the sharp null hypothesis o f no treatment effect fo r any observation, w hich should n o t be confused w ith the null hypothesis o f no average effect. Suppose the treatm ent effect were 5 for h alf of the observations and —5 for the other half. In this scenario, the average effect w ould be zero, but the sharp null hypothesis w ould be false. Both null hypotheses have the testable im plication that the average outcom e in the treatm ent group will be sim ilar to the average outcom e in the control group, but the sharp null hypothesis has the fu rth er testable im plication th at the two groups will have sim ilarly shaped distributions. In order to conduct a m ore exacting evaluation o f the sharp null hypothesis, we m ay consider statistics other than the estim ated ATE. Because the sharp null hypoth esis provides us w ith the com plete schedule o f potential outcom es, we can simulate the sam pling distribution o f any test statistic. For example, we could com pare the variances in the treatm ent and control groups. If treatm ent effects vary and are uncor related w ith or positively correlated with 7(0), the observed variance o f 7(1) will tend to be larger than the variance of 7(0). The sam pling distribution u n d e r the sharp null hypothesis indicates the probability o f observing a difference in variances as large as w hat we in fact observe in o u r sample. We simply calculate all possible differences and com pute the p-value of the difference we obtained from our experim ent. We return to these and o th er diagnostic tests o f heterogeneous treatm ent effects in C hapter 9.
3.5
Confidence Intervals
W hen faced w ith a decision, policy m akers m ay tu rn to an experim ent for guidance about the average effect o f an intervention. They typically w ant to know how big the average treatm ent effect is, not w hether the effect is statistically distinguishable from
SA M P LI N G DISTR IBU TIO NS AND STATISTICAL IN F E R E N C E
67
zero. Testing the null hypothesis is beside the point; th eir principal objective is to use the results to form a guess about the value o f the ATE. Interval estim ation is a statistical procedure th at uses the data to generate a probability statem ent about the range o f values w ith in w hich a p aram eter is located. For example, suppose a researcher conducts an ex p erim en t in an attem p t to learn the value o f the ATE. Using a form ula th a t will be described m om entarily, the researcher generates a 95% confidence interval th at ranges from (2,10). This interval has a 0.95 probability o f bracketing the tru e ATE. In o th er w ords, if we im agine a series o f hypothetical replications o f this experim ent u n d e r identical conditions, 95 out o f 100 ran d o m assignm ents will generate intervals th at bracket the tru e ATE. W h en in te r preting confidence intervals, rem em b er th a t th e location o f th e interval varies from one experim ent to the next, w hile th e tru e ATE rem ains constant. By convention, social scientists co n stru ct 95% confidence intervals, bu t form ing a 95% confidence interval based on the results o f a single ex p erim en t requires som e guesswork. Recall th a t a single experim ent reveals the 7 (1 ) outcom es for the treated subjects and 7(0) outcom es for th e u n treated subjects. We do n o t observe th e full schedule o f potential outcom es. W ith o u t th e full schedule o f p o ten tial outcom es, we can n o t sim ulate the sam pling d istribution o f the estim ated ATE. The best we can do is approxim ate th e full schedule o f potential outcom es by m ak in g an educated guess about the unobserved 7(0) outcom es for th e treated subjects an d 7 (1 ) outcom es for the un treated subjects. The m ost straightforw ard m e th o d for filling in m issing p o ten tial outcom es is to assum e th at the treatm en t effect r. is the sam e for all subjects. For subjects in the c o n trol condition, m issing 7(1) values are im puted by adding the estim ated ATE to the observed values o f 7(0). Similarly, for subjects in th e tre atm en t condition, m issing 7(0) values are im puted by subtracting the estim ated ATE from the observed values of 7(1). This approach yields a com plete schedule o f p o ten tial outcom es, w hich we m ay th en use to sim ulate all possible ra n d o m allocations. In o rd er to form a 95% co n fidence interval, we list the estim ated ATE from each ra n d o m allocation in ascending order. The estim ate at the 2.5th percentile m arks the b o tto m o f the interval, an d the estim ate at th e 97.5th percentile m arks the top o f th e interval.10 W h en the treatm en t and control groups contain an equal n u m b e r o f subjects (m = N — m ), this m eth o d o f form ing confidence intervals will te n d to be c o n servative, in the sense that th e interval typically will be w ider th a n th e interval we
10 A m ore complex and com putationally intensive approach is to “invert” the hypothesis test (R osen baum 2002). This m eth o d involves hypothesizing an ATE called ATE*, subtracting it from the observed outcom es in the treatm ent group to approxim ate 7 (0 ), and testing the null hypothesis that the average (adjusted) 7.(0) in the treatm ent group is equal to the average 7 (0 ) in the control group. Progressively larger values o f ATE* are tried until one locates a value that has a p-value o f 0.025 and another that has a p-value o f 0.975; these values m ark the range o f the 95% confidence interval. This m ethod tends to be m ore accurate th an the sim pler m eth o d we describe, b u t b oth tend to produce sim ilar results, especially in large samples.
S A M P L I N G D IS T RI BU T IO NS AND STATISTICAL I N F E R E N C E
68
w ould obtain if we actually observed Y.(0) and Y (l) for all subjects.11 W hen the n u m ber of subjects in treatm ent and control differ, this m ethod m ay produce con fidence intervals th at are too wide or too narrow, depending on w hether Var(Y(0)) is larger or sm aller th an V ar(Y (l)). The m ethod is conservative w hen the control group is larger an d V ar(Y (0)) > V ar(Y (l)) or w hen the treatm ent group is larger and V ar(Y (l)) > V ar(Y (0)). W hen w orking w ith treatm ent and control groups of m arkedly different sizes, one should use this m eth o d for estim ating confidence in ter vals w ith caution if outcom es in the sm aller group are substantially m ore variable. A rule of thum b is that caution is required if the sm aller group is less than h alf the size of the larger group, and the sm aller groups standard deviation is at least twice as large as the larger group’s standard deviation. O ne reason to assign equal num bers of subjects to treatm en t and control is that this type o f “balanced” experim ental design facilitates interval estim ation by elim inating these complications. To illustrate interval estim ation, we analyze data from C lingingsm ith, Khwaja, and K rem er s study of Pakistani M uslim s w ho participated in a lottery to obtain a visa for the pilgrim age to M ecca.12 By com paring lottery w inners to lottery losers, the authors are able to estim ate the effects of the pilgrim age13 on the social, religious, and political views o f the participants. Here, we consider the effect o f w inning the visa lottery on attitudes tow ard people from o th er countries. W inners and losers were asked to rate the Saudi, Indonesian, Turkish, African, European, and C hinese people on a five-point scale ranging from very negative ( —2) to very positive (+ 2 ). Adding the responses to all six item s creates an index ranging from —12 to + 12. The distribution o f responses in the treatm ent group (N = 510) and control group (N = 448) is presented in Table 3.2. The average in the treatm ent group is 2.34, as com pared to 1.87 in the control group. The estim ated ATE is therefore 0.47. The estim ated standard deviations in the treatm en t and control groups are 2.63 and 2.41, respectively. In this study, the treatm en t and control conditions contain similar num bers of subjects, and the sm aller of the two groups does n o t have m ore vari-
11 In small samples, the distance between the estim ated ATE and the top (or bottom ) of the estimated confidence interval should be w idened by a factor o f v ( N — \ ) / { N — 2) to adjust for the fact that the ATE used to fill in the full schedule of potential outcomes is itself estim ated from the data. For example, suppose a study o f ten subjects were to produce an estim ated interval ranging from 8 to 12. This correction would widen the interval by a factor of 1.061, resulting in an adjusted interval that ranges from 7.88 to 12.12. See Samii and A ronow 2011. As N increases, this correction becomes negligible. 12 Clingingsm ith, Khwaja, and Kremer 2009. 13 O ur description sidesteps the com plications that arise due to the fact that 14% o f the lottery losers nonetheless m ade the pilgrimage and 1% o f the lottery w inners failed to make the pilgrimage. Anticipating the discussion o f noncom pliance in Chapters 5 and 6, we estimate the intent-to-treat effect: the effect of winning the lottery. In order to simplify the presentation, we sidestep the problem o f clustering (multiple individuals applying for the lottery together) by random ly selecting one person from each cluster. For ease of presentation, we also ignore the slight differences in treatm ent probabilities across blocks defined by the size and location o f the parties applying for a visa. Controlling for blocks has negligible effects on the results.
SA M P LI N G D IS TR IBU TIO NS AND STATISTICAL I N F E R E N C E
69
TABLE 3.2
Pakistani M u s l i m s ' ratings of peoples from foreign countries by s u c c e s s in the visa lottery Distribution of responses Ratings of people ---------------------------------------------------T r e a tm e n t (%) h e r c o u n tr ie s C o n tro l (%) -1 2
0.00
0.20
-9
0.22
0.00
-8
0.00
0.20
-6
0.45
0.20
-5
0.00
0.20
-4
0.45
0.59
-3
0.00
0.20
-2
1.12
0.98
-1
1.56
2.75
0
27.23
18.63
1
18.30
13.14
2
24.33
25.29
3
8.48
10.98
4
5.80
9.61
5
3.35
3.92
6
3.79
7.25
7
2.23
2.55
8
0.89
1.37
9
0.22
0.78
10
0.45
0.00
11
0.67
0.20
12
0.45
0.98
T otal N
100 (448)
100 (510)
Source: Clingingsmith, Khwaja, and Kremer 2009.
able outcom es, so o u r m e th o d for com puting confidence intervals is expected to be reasonably accurate. In order to estim ate o u r 95% interval, we m u st form a com plete schedule o f potential outcom es. We ad d 0.47 to the observed 7 (0 ) outcom es in the control group in o rd e r to approxim ate the control g ro u p s u n o b serv ed 7 (1) values; we su b tract 0.47 from the tre a tm e n t groups observed 7 (1 ) outcom es in o rd er to
70
S A M P L I N G D IS T RI BU T IO NS AND STATISTICAL I N F E R E N C E
approxim ate the treatm ent groups unobserved 7(0) values. Sim ulating 100,000 ra n dom allocations using this schedule of potential outcom es and sorting the estim ated ATEs in ascending order, we find that the 2,500th estim ate is 0.16 and the 97,501st estim ate is 0.79, so the 95% interval is [0.16, 0.79]. The statistical interpretation of this interval is as follows: over hypothetical rep lications of this experim ent, intervals created in this m an n er have a 95% chance of bracketing the tru e ATE. So w ithout other inform ation about the tru e ATE, we con clude that there is a 95% probability that the interval from 0.16 to 0.79 includes the tru e ATE. Substantively, we infer that w inning the visa lottery led to an increase in posi tive feelings tow ard people from foreign countries. Unfortunately, given the data at hand, there is no easy way to translate the estim ated ATE o f 0.47 into o th er m etrics th at have m ore tangible m eaning in term s o f societal outcom es or individual behav ior. For example, we do not know how positive responses on this scale translate into cooperative diplom atic relations, increased international trade, or friendly behavior tow ard visitors from these countries. A nd because the outcom e m easure is specific to this study, we are unable to com pare the 0.47 effect o f this intervention to the effect of o th er interventions. This gap in ou r understanding sets the stage for fu rth er research. Now that this visa lottery has revealed a causal effect, the next steps are to conduct fu rth er visa lottery studies using other outcom e m easures and to m easure the effects o f o th er types o f interventions using this survey m etric. W hen presented w ith experim ental results th at are n o t scaled in relation to inter pretable outcom e m etrics or the effects o f o th er interventions, researchers often fall back on the calculation o f standardized effect size. This approach com pares the esti m ated effect to the naturally occurring degree o f variation in outcom es by dividing the estim ated ATE by the standard deviation in the control group.14 Using this for m ula, the apparent ATE in this study moves people by about one-fifth o f a standard deviation. Again, we confront a problem o f interpretation. Is a 0.2 m ovem ent in stan dard deviation big or small? Researchers som etim es invoke rules o f thum b: effects o f less than 0.3 are considered small, betw een 0.3 and 0.8 are considered m edium , and above 0.8 are considered large.15 O ne should be cautious about applying these standards for th ree reasons. First, the standard deviation is a sam ple-specific statis tic; if ones experim ental subjects happen to be Pakistanis w ho share a sim ilar view o f foreigners p rio r to the intervention, the standard deviation will be small, and the standardized effect will seem large. Second, the standard deviation tends to increase w hen outcom es are m easured w ith error, as is often the case w ith survey m easures o f attitudes. Third, even small standardized effects can be substantively im portant if they alter a hard-to-m ove dependent variable. The standard deviation o f m ens height
14 This standardized statistic is know n as Glass’s A , from Glass 1976. 15 C ohen 1988.
S A M P LI N G D IS TRI BUT IO NS AND STATISTICAL I N F E R E N C E
71
is about 2.8 inches, but a dietary supplem ent th at causes a half-inch increase in height w ould be h eralded as rem arkable. In m uch the sam e way, an in terv en tio n th at p ro duces a change in attitudes tow ard foreigners is n o tew o rth y given the difficulty o f changing attitudes in this dom ain.
3.6
Sam pling Distributions for Experiments That Use Block or Cluster Random A ss ig n m e n t
The concepts and estim ation techniques presented in previous sections m ay be adapted to experim ents in w hich subjects are assigned ran d o m ly bu t in ways th at d ep art from sim ple o r com plete ran d o m assignm ent. In this section, we discuss tw o such classes o f experim ental designs: block ran d o m izatio n an d cluster random ization.
3 .6.1
Block Random A ssig n m e n t
Block ran d o m assignm ent refers to a procedure w hereby subjects are p a rtitio n e d into subgroups (called blocks or strata), an d com plete ra n d o m assignm ent occurs w ith in each block. For example, suppose we have 20 subjects in o u r experim ent, 10 m en and 10 w om en. Suppose o u r experim ental design calls for 10 subjects to be placed into the treatm en t condition. If we were to use com plete ra n d o m assignm ent, chances are that we w ould end up w ith u nequal n um bers o f m en and w om en in the treatm en t group. Block random ization, on the o th e r hand, ensures equal n u m b ers o f m en and w om en will be assigned to each experim ental condition. First, we p artitio n the su b ject pool into m en an d w om en. From the pool o f m ale subjects, we ran d o m ly assign five into th e treatm en t group; from the pool o f fem ale subjects, we ran d o m ly assign five into the treatm en t group. In effect, block ran d o m izatio n creates a series o f m in ia ture experim ents, one per block. Block random ized designs are used to address a variety o f practical and statisti cal concerns. Som etim es program requirem ents dictate how m any subjects o f each type to place in th e treatm en t group. Im agine, for exam ple, th a t a su m m er reading pro g ram aim ed at elem entary school students seeks to evaluate its im p act on school perform ance and retention d u rin g the following academ ic year. The school is able to adm it only a sm all fraction o f those w ho apply, and school ad m in istrato rs w o rry th at if to o m any children w ith low levels o f preparedness are ad m itted to the program , teachers will find it difficult to m anage th eir classes effectively. These ad m in istrato rs insist th at 60% o f th e children adm itted to the p ro g ram pass an initial test o f basic skills. The way to address this concern is by blocking on initial test scores and allocat ing students w ithin each block so th at 60% o f the students w ho are ran d o m ly a d m it ted to the program have passed the basic skills test. For exam ple, suppose the school
72
S A M P L I N G D I STR IBUT IO NS AND STATISTICAL I N F E R E N C E
BOX 3.8 Num ber of Possible A ssign m e n ts under Complete Random A ssign m en t and Blocked Random A ssign m en t Let 0 < m < N . O f N observations, m are placed into the treatm ent group, and N — m are placed into the control group. The n u m ber o f possible ran dom izations u n d e r com plete random assignm ent is Afl m \{N — m)\ For example, the num ber of random izations u n d er com plete ran d o m assign m ent w hen N = 20 and m — 10 is 20 !
184,756.
10110 !
In order to calculate the num ber of possible random assignm ents u n d er a blocked design w ith B blocks, calculate the n u m ber o f ran d o m alloca tions in each block: r , r2>. . . , r . The total n u m ber o f random assignm ents is Tj X r , X . . . X rB. For example, w hen we random ly assign half o f the 10 m en to treatm en t and h alf o f the 10 w om en to treatm ent, there are
/ 10! V 10! \ V5!5!/ \5!5!
= 63,504
possible ran d o m allocations.
can adm it 50 o f the 100 applicants. Forty o f the applicants failed the initial test, and 60 passed. The researcher could create two blocks: one block o f students w ho passed the basic skills test and another block of students w ho failed. Each block is sorted in random order, an d the researcher selects the first 20 students in the block containing those w ho failed the basic skills test and the first 30 students in the block contain ing students who passed the test. This procedure ensures that the 60% requirem ent is satisfied. This design approach com es in h andy w hen resource constraints pre vent researchers from treating m ore than a certain n u m b er o f subjects from certain regions or w hen concerns about fairness dictate that treatm ents be apportioned equally across dem ographic groups. Block random ization also addresses two im portant statistical concerns. First, blocking helps reduce sam pling variability. Som etim es the researcher is able to p a rti tion the subjects into blocks such that the subjects in each block have sim ilar p o ten tial outcom es. For example, students w ho fail the basic skills test presum ably share sim ilar potential outcom es; the sam e goes for students w ho pass the basic skills test.
SA M P LI N G DISTR IBU TIO NS AND STATISTICAL IN F E R E N C E
73
By random izing w ithin each block, th e researcher elim inates the possibility o f rogue random izations that, by chance, place all o f the stu d en ts w ho fail th e basic skills test into the treatm en t group. U nder sim ple o r com plete ra n d o m assignm ent, these o u t landish assignm ents rarely occur; u n d e r block ran d o m assignm ent, th ey are ruled out entirely. Second, blocking ensures th a t certain subgroups are available for sep a rate analysis. W hen analyzing a study involving 10 m en an d 10 w om en, a researcher m ight be interested in com paring the ATE am ong m en to th e ATE am ong w om en. But w hat if com plete ran d o m assignm ent puts 9 o f the 10 m en into the treatm en t group and 9 o f the 10 w om en in th e control group? In th at case, the treatm en t effects am ong m en and am ong w om en w ould b o th be estim ated very imprecisely. Blocked random ization guarantees th at a specified p ro p o rtio n o f a subgroup will be assigned to treatm ent. In order to illustrate how blocking works, lets consider a stylized example inspired by Olken’s study o f corru p tio n in Indonesia.16 The subjects in th is ex p erim en t are public w orks projects, and the tre atm en t is h eigh ten ed financial oversight by go v ern m ent officials. O utcom es are m easured in term s o f the a m o u n t o f m o n ey th at is u n a c counted for (and presum ably stolen) w hen th e books are closed on th e project. For purposes o f illustration, we present in Table 3.3 th e com plete schedule o f p o ten tial outcom es for 14 projects, 8 o f w hich are in Region A an d 6 in R egion B. Because o f resource constraints, each region has th e capacity to audit only tw o o f its projects. In our example, the ATE is —3 in Region A and —5 in Region B. For b o th regions c o m bined, the ATE is ( —3 )(8 /1 4 ) + ( —5 )(6 /1 4 ) = —3.9. In general, the relationship betw een the overall ATE and the ATE w ithin each block j is: / Nt ATE =
E
t
7 A T E ,>
(3.10)
w here / is th e n u m b e r of blocks, the blocks are indexed by j, an d th e w eight N ./ N denotes the share o f all subjects w ho belong to block j. Before studying the statistical precision o f the blocked design, it is useful to co n sider as a p o in t o f com parison the precision o f a design th a t uses com plete ra n d o m assignm ent. Suppose treatm ents h ad been assigned to any 4 o f th e 14 projects th ro u g h com plete ran d o m assignm ent. E quation (3.4) indicates th at th e tru e stan d ard e rro r would have been:
S «A T E ) ^
+
+ (2)(31.03)} = 3.50.
In o rd er to calculate the standard e rro r from a blocked design, we m u st first calculate the stan d ard e rro r w ithin each block. In o u r stylized exam ple, th e projects w ithin each region have sim ilar potential outcom es. As th e variances o f YJ(0) an d Y.( 1) are 16 O lken 2007.
S A M P L I N G D I STR IBU T IO NS AND STATISTICAL I N F E R E N C E
74
TABLE 3.3
Schedule of potential outcomes for public w o rk s projects when audited (Y(D) and not audited (Y(0)) All s u b je c ts
Block A s u b je c ts
V illage
B lock
Y(0)
Y(1)
Y(0]
Y(1)
1
A
0
0
0
0
2
A
1
0
1
0
3
A
2
1
2
1
4
A
4
2
4
2
5
A
4
0
4
0
6
A
6
0
6
0
7
A
6
2
6
2
8
A
9
3
9
3
9
B
14
10
B
11
B lock B su b je c ts Y(0J
YI1)
12
14
12
15
9
15
9
B
16
8
16
8
12
B
16
15
16
15
13
B
17
5
17
5
14
B
18
17
18
17
16.0
11.0
9.14
5.29
4.00
1.00
V arian ce
40.41
32.49
7.75
1.25
CovlY(O), Y(1))
31.03
M ean
1.67
2.13
17.0
1.00
m uch lower w ithin each region th an they are w hen the two regions are com bined, the standard error drops m arkedly w hen we analyze each region separately. For Region A, the standard erro r is 1.23; for Region B, it is 2.71. The rem aining task is to use these block-specific standard errors to assess the uncertainty of the estim ated ATE for sub jects in bo th regions com bined. The form ula tu rn s out to be straightforw ard and eas ily extends to any num ber o f blocks.17 For two blocks, the standard erro r is:
S£(ATE) =
JiSE^J
+
.
1312)
17 The form ula for any num ber of blocks is
SE(ATE) = ^ j 2 | ^ J s £ 2(AfE)This form ula follows from the general rule about the variance o f a sum of independent random variables: V ar(aA + pB) ~ a 2Var(A) + /32Var(B).
SA M P L I N G DISTR IBU TIO NS AND STATISTICAL I N F E R E N C E
w here SE. refers to the standard e rro r o f th e estim ated ATE in block
75
an d N. refers
to the n u m b er o f observations in block j. Filling in the n u m b ers from o u r exam ple gives a stan dard e rro r of: S£(A TE) =
+ (2.71
= 136-
(3' 131
The exam ple illustrates the poten tial benefits o f blocking. By m ak in g a sm all design change, we greatly im prove th e precision w ith w hich we estim ate th e ATE. The s ta n dard e rro r plu m m ets from 3.50 to 1.36. The stark difference in sam pling d istrib u tio n s is illustrated in Figure 3.1. U nder com plete ra n d o m assignm ent, 141/1001 = 14.1% o f the estim ated ATEs are g reater th a n zero, w hich m ean s th ere is a 14.1% chance that the ex p erim en t will indicate th a t audits are ineffective o r actually exacerbate pilfering even th o u g h (as we know from Table 3.3) th e tre a tm e n t is effective. U n d er blocked assignm ent, ju st 1 /4 2 0 = 0.2% o f th e estim ated ATEs are greater th a n zero. Let’s now consider how one w ould go about analyzing a block ran d o m ized experim ent, such as the one rep o rted in Table 3.4, w hich show s the observed o u t comes from a single experim ent based on the schedule o f po ten tial outcom es from Table 3.3. E stim ating the overall ATE is straightforw ard: first estim ate th e ATE w ith in
FIGURE 3.1
Sam p li n g distribution under complete randomization (above); sampling distribution under blocked randomization (below) 0.30
g
"