Quantitative data analysis: doing social research to test ideas [1 ed.] 0470380039, 9780470380031

Table of contents : Contents......Page 3 Tables, figures, exhibits, and boxes......Page 8 Preface......Page 19 The autho

464 36 122MB

English Pages 448 Year 2009

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Quantitative data analysis: doing social research to test ideas [1 ed.]
 0470380039, 9780470380031

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

QUANTITATIV DATAANALYSIS I Re serch a DoingS ocia to Testldeas

D ONALD I. TRE IMA N

If i?j[i,i:l[fri:,

reserved' Copyright@2009by JohnWiley & Sons'Inc All dghts by JosseY-Bass Published

com cA 941O3-wwwjossevbass ftltijt?tlltJ,l'",, t"' Francisco, form stored in a retrieval system' or tansmitted in any No part of this publication may b€ reproduced' exceptas oth"*ise' ol t:o:lt:q. or bv anv means,elecfonic, mechatucal,photocopying'-recording theprior either Act'without

;:#;i

ffi;;!;".1b;i

ul'iei s'ut"'copvright

roa

"r aulori^tion trttougrtpuy-"ni of-theappropriate-p"-t:1oP1*" "' wrinen Dermissionof the putrisrrer,or 'i]"'iiie (e78) 750-8400' oiuq n-u"'l MA'ore23' ;;;;'I*: il;i""i iliilt:;;;ii should permission for publisher the t n"q*t" '?;;il"*ooa o. onttn"ut *t* fax (978)646-8600, NJ Hoboken' stree! "clt River "oi-yig;' l,1l Inc : to thePer.ir.ion. o"ptii!'ni,i"-rt^ wii"y n Sons' be addressed ssrons. www.wiley.com/so/pen at online or oiriid, iii,1j i1d_oor1,fax 201,744_6008, ascitationsor sourcesfor further information Readersshouldbe awarethat InternetWebsitesoffered waswrittenandwhenit is rcad' this time the between disappeared .. ."1 t """-.ft-ag"a publisherandauthorhaveusedtheir bestefforts Limit of Liability/Disclaimer of warranty: while the or com wi*l respeclto lhe accuracy or lhis book.Lheymakeno repre'enlations wafianlie' in DreDaring or merchanrabil' warranties implied rr,i. roor #i ,fi"iri.aiiy di-ctaimany ;iJ";K, ;i ;;;.;;;;;,,'oi

ffi;il;?;;ili."iu,'pttp"t"

n'*Lantvmavbecreatedorextendedtysalesrei::il

I The aivice and strategies contained herein may-not ,"1* .it"tials .i *iii." nor author shall publisher the leither upp.op;ut". ation.you should consutt wltt, a protessiinut-*-fi.." to special' limited not but including oit*'' be liable for any loss of p.ot t o' "ommerciJdamages' -ydamaBes' or other con(equential. rncidental.

most bookstores To-contactJossey-Bassdirectl) JosseyBass books and products are availablelhrough the United *itio if," Unitla Star". ur 1a0O)956-?739' outside call our CusromerCar" u"p*"n, (317) 572-4002' Siatesat (3ll) 572-3986' oi via fa'x at formats some content that appearsin Jossev-Bassalso publishesits books in a variety ofelecftonic print may not be ivailable in electronic books' Library of Congress Cataloging'in-Publication

Data

Donald J. Treiman, -jutu unalysis : doing social researchto test ideas/ Donald J Treiman d"-tl[G D, Cm,

2.Sociorogv-f,esearch-statist "liJJj;l.T'3;:3:;:,t3*"f33?,*,n"^"thods. methods-Computer + Socialsciences-statistical

methods. 3. Sociology-statisticar -"if'oOt programs. 5. Stata. I Title HA29.T675 2008 300;72-4c22 Printed in the United StatesofAmerica FIRST EDITION

PB Printing

l0 9 8 7 6 5 '1 3 I I

20080131:v

-*fq-$ Tg$XT'-{. fables, Figur€s,Exhibits. and Boxes

Xi

Preface

xxiii

The Author

xxvii

Introduction CROSS-TAB U LATIONS What This ChapterIs About Introductionto the Book via a ConcreteExample Cross-Tabulations What This ChapterHas Shown MORE ON TABLES What This ChapterIs About The Logic of Elaboration SuppressorVariables Additive and InteractionEffects Direct Standardization

xxix 1 1 2 8 19 21 z1 22 ).) 26 28

A Final Note on StatisticalControlsVersusExperiments What This ChapterHas Shown STILLMORE ON TABLES What This ChapterIs About ReorganizingTablesto Extract New Information When to Percentagea Table "Backwards"

45 47 47 48 50

Cross-Tabulations in Which the DependentVariable Is Representedby a Mean Writing About Cross-Tabulations

52 58 61

What This ChapterHas Shown

o-1

Index of Dissimilarity

Vl

Contents

4 ONTHEMANIPULATION OFDATABYCOMPUTER

o)

What This ChaprerIs Abour

tr)

Introduction

66

How Data Files Are Organized Transforming Data What This ChapterHas Shown Appendix 4.A

Doing Analysis Using Stata Tips on Doing Analysis Using Stata Someparticularly Useful Stata 10.0Commands

INTRODUCTIONTO CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About Introduction Quantifying the Size of a Relationship:RegressionAnalysis Assessingthe Strengthof a Relationship: CorrelationAnalysis The RelationshipBetweenCorrelation and RegressionCoefficients FactorsAffecting the Size of Correlation(and Regression)Coeflicients CorrelationRatios What This ChapterHas Shown 6

INTRODUCTIONTO MULTIPLE CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About .

Introduction A WorkedExample:The Determinants of Literacy in China Dummy Variables A Strategyfor ComparisonsAcross Grouos A BayesianAlternativefor Comparing Models IndependentValidation What This ChapterHas Shown

MULTIPLE REGRESSION TRICKs: TECHNIQUES FOR HANDLING SPECIAL ANALYTIC PROBLEMS What This ChapterIs About NonlinearTransformations

OI

72 80 80 80 84

87 87 88 89 o1

94 94 99 102

r03 103 104 113 120 124 133 135 136

139 139 140

contentsVii Tesrin,ethe Equality of Coefficients TrendAnalysis: Testingthe Assumption of Linearity LrnearSplines Lrpressing Coefficientsas Deviationsfrom

MULTIPLEIMPUTATIONOF MISSING DATA \\tar This ChapterIs About lntroduction -\ WorkedExample:The Effect of Cultural Capital on EducationalAttainmentin Russia \\hat This ChaprerHas Shown SAMPLEDESIGNAND SURVEYESTIMATION \\har This ChapterIs About SurveySamples Conclusion \nlar This ChapterHas Shown REGRESSION DIAGNOSTICS what This ChapterIs About Introduction A WorkedExample:SocietalDifferences in StatusAttainment RobustRegression

' ! 1 SCALECONSTRUCTION What This ChapterIs About Introduction

149 152

the

Grald Mean (Multiple ClassificationAnalysis) OrherWaysof RepresentingDummy Variables Decomposingthe DifferenceBetween Two Means \\'har This ChapterHas Shown

Bootstrappingand StandardErrors What This ChapterHas Shown

147

r64 166 172 179 181 181 \82 187 194 195 t95 196 )t7

224 225 225 226 229 237 238 240 241 241 1,41

Validiry Reliability

242 243

Vlll

12

Contents ScaleConstruction

246

Errors-in-VariablesRegression What This Chapter Has Shown

258

LOG-LINEARANALYSIS What This ChapterIs About Introduction Choosinga PrefenedModel ParsimoniousModels A Bibliographic Note What This ChapterHas Shown Appendix 12.A Derivation of the Effect parameters Appendix 12.8 Introductionto Maximum Likelihood Estimation Mean of a Normal Distribution Log-Linear Parameters

,'3

BINOMIAL LOGISTICREGRESSION What This ChapterIs About Introduction Relationto Log-LinearAnalysis

261

263 263 264 265 277 294 295 295 297 298 299 301 301 302 303

A WorkedLogistic RegressionExample: PredictingPrevalenceof Armed Threats A SecondWorkedExample:SchoolingprogressionRatiosin Japan

304 314

A Third WorkedExample (Discrete-TimeHazard_Rate Models): Age at First Marriage

318

A FourthWorkedExample(Case-ControlModels): Who WasAppointed to a Nomenklataraposition in Russia? What This ChapterHas Shown Appendix l3.A Some Algebra for Logs and Exponents Appendix 13.8 Introduction to probit Analvsis

327 329 330 330

14 MULTINOMIAL AND ORDINALLOGISTIC REGRESSION AND TOBITREGRESSION WhatThisChapterIs About Muhinomial LogirAnalysis

335 J J.)

336

Contents lX frinal

Logistic Regression

342

Tobit Regression(andAllied Procedures)for Censored DependentVariables Otter Models for the Analysis of Limited DependentVariables &'hat This ChapterHas Shown

t5

353 360 361

IMPROVINGCAUSAL INFERENCE: FIXED EFFECTS AND RANDOM EFFECTS MODELING What This ChapterIs About Introduction Frxed Effects Models for Continuous Variables RandomEffects Models for ContinuousVariables A Worked Example: The Determinants of Income in China Fired Effects Models for Binary Outcomes A Bibliographic Note Wtat This ChapterHasShown

363 363 364 365 371 372 375 380 380

1 6 FINALTHOUGHTS AND FUTURE DIRECTIONS:

RESEARCH DESIGN AND INTERPRETATION ISSUES whar rhis Chapter is About ResearchDesignIssues The Importanceof Probability Sampling A Final Note: Good ProfessionalPractice What This ChaDterHas Shown

38r 381 382 397 400 405

Appendix A: Data Descriptions and Download Locations fot lie Data Used in This Book

407

Appendix B: Survey Estimation with the General Social Survey

4',11

References

417

lndex

431

':-,-,::,li::1,i' ;.l.ll LiFl,-..,

a:.x:X Ii:::.-i,:;,,*rXf":* i-::'.,:: i, TABLES I .1.

Joint FrequencyDisrributionof Militancy by Religiosity Among UrbanNegroesin the U.S., 1964.

1.2.

PercentMilitant by ReligiosityAmongUrbanNegroes in the U.S., 1964.

10

PercentageDistribution of Religiosity by EducationalAttainment, UrbanNegroesin the U.S., 1964.

l3

PercentMilitant by EducationalAttainment,Urban Negroes in the u.s., 1964.

l3

PercentMilitant by Religiosity and EducationalAttainment, UrbanNegroesin the U.S., 1964.

15

PercentMilitant by Religiosity and EducationalAttainment, Urban Negroesin the U.S., 1964(Three-DimensionalFormat).

18

PercentageWho Believe Legal Abortions ShouldBe PossibleUnder SpecifiedCircumstances,by Religion and Education,U.S. 1965 (N : 1,368;Cell Frequencies in Parentheses).

27

Percentage AcceptingAbortion by Religion and Education (HypotheticalData).

28

PercentMilitant by Religiosity,and PercentMilitant by Religiosity Adjusting (Standardizing)for Religiosity Differencesin Educational Attainment,UrbanNegroesin the U.S., 1964(N : 993).

30

1.3. 1.4. 1.5. 1.6. Ll.

2.2. 2.3.

2.4.

PercentageDistribution of Beliefs Regardingthe Scientific View of Evolution(U.S.Adults,1993.1994.and2000).

2.5.

Percentage Accepting the ScientificView of Evolution by ReligiousDenomination(N : 3,663).

2.6.

Percentage Acceptingthe ScientificView of Evolution by Level of Education.

2.7.

Percentage Accepting the ScientificView of Evolution by Age.

2.8.

Percentage Distributionof Educational Attainmentby Religion

2.9.

PercentageDistribution ofAge by Religion.

2.10.

Joint ProbabilityDistribution of EducationandAge.

33

35 35 36

Xll

Tables,Figures,Exhibits,and Boxes

2 .11. PercentageAccepting the ScientificView of Evolution by Religion, Age, and Sex (PercentageBasesin Parentheses) 2.12. ObservedProportionAccepting the ScientificView of Evolution, and ProportionStandardizedfor EducationandAge. 2.r3. PercentageDistribution of OccupationalGroupsby Race,South African Males Age 20-69, Early 1990s(Percentages ShownWithout Controlsand also Directly Standardizedfor Racial Differencesin EducationalAttainment";N = 4,004). 2 .14. Mean Number of ChineseCharactersKnown (Out of 10), for Urban and Rural ResidentsAge 20-69, China 1996(MeansShown Without ControlsandAlso Directly Standardizedfor Urban-Rural Differencesin Distribution ofEducation; N : 6,081). FrequencyDistribution ofAcceptanceof Abortion by Religion andEducation,U.S.Aduits, 1965(N : 1,368). Social Origins of Nobel Prize Winners(1901-1972)and Other U.S. Elires (and,for Comparison,the Occupationsof EmployedMales i900-1920). 3.3. MeanAnnual Income in 1979Among ThoseWorking Full Time in 1980,by Educationand Gender,U.S. Adults (Category FrequenciesShownin Parentheses). Meansand StandardDeviationsof Income in 1979bv Education and Gender,U.S. Adults, 1980. 3.5. MedianAnnual Income in 1979Among ThoseWork rg Full Time in 1980, by Educationand Gender,U.S. Adults (CategoryFrequencies Shownin Parentheses).

6.2.

6.3. 6.4.

PercentageDistribution Over Major OccupationGroupsby Race and Sex,U.S. Labor Force, 1979(N : 96,945). Mean Number of PositiveResponsesto an Acceptanceof Abortion Scale(Range:0-7), by Religion, U.S. Adults, 2006. Means,StandardDeviations,and CorrelationsAmong Variables Affecting Knowledgeof ChineseCharacters,EmployedChinese Adults Age 20-69, 1996(N = 4,802) Determinantsof the Number of ChineseCharactersConectly Identifiedon a Ten-ItemTest,EmployedChineseAdults Age2U69,1996 (StandardEnors in Parentheses). Coefficientsof Models ofAcceptanceofAbortion, U.S. Adults, 1974 (StandardErrors Shownin Parentheses); N : 1,481. Goodness-of-FitStatisticsfor Altemative Models of the Relationship Among Religion, Education,andAcceptanceofAbortion, U.S. Adults, 1973(N = 1,499). DemonstrationThat Inclusionof a Linear Term Does Not Affect PredictedValues.

37 39

4l

42 48 51

52

58 60

101

115

116 127

136

153

Tables, FiguretExhibits. and BoxesXiii ":

"-i

-.4

Cefficiens for a Linear Spline Model of Trends in years of Sciool Compleredby year of Birth, U.S. Adults Age 25 and Older, ad Comparisonswith Other Models (pooled Datafor 1972_2004, \ : -19.324). Goodness-of-FitStatisticsfor Models of Knowledgeof Chinese Cba-actersby year of Birth, Controlling for years of Schooling, rirh \-arious Specifications of the Effect of the Cultural Revolution rTbose Affected by the Cultural Revolution Are Deflned peoole as Tuning Age I I During the period 1966ttuough 1977),Chinese -{dnlts Age 20 ro 69 in 1996(N = 6,086). Cocfficientsfor Models 4, 5, and 7 predicting Knowledgeof Chinese Charactersby year of Birth, Controliins for ye;rs ( p Valuesin parentheses). of Scbooti_ng

--s

CoefficientsofModels of ToleranceofAtheists, U.S. Adults, 1[O to 2004 (N : 4,299). -6, Desiga Matrices for Alternative Ways of Coding Categorical \-ariables(SeeText for Details). Coefficients for a Model of the Determinants of Vocabulary Knorrledge,U.S. Adults, 1994(N : 1,,757R2 : .2445: Sald TestThat CategoricalVariablesAll Equal Zetot F.t,rrrt = 12.48; p :.r.iation Membership. -::quenl - ::quenl Distribution of Occupationby Father'sOccupation, C:rnese-{dults,1996. -:,:;raction Parametersfor the SaturatedModel Applied to Table 12.9. G..odness-of-FitStatisticsfor AlternativeModels of Intergenerational O,-cupational Mobility in China(Six-by-SixTable).

'

275

276 278 280 282 284

F:;quency Distribution of EducationalAttainmentby Size of ?,::e of Residenceat Age Fourteen,ChineseAdults Not Enrolled :: School.1996.

289

P.rcentageEver Threatenedby a Gun, by SelectedVariables,U.S. {Jults. 1973to 1994(N : 19,260).

306

G..t dness-of-FitStatisticsfor VariousModels Predictingthe P::ralenceof ArmedThreatto U.S.Adults, 1973to 1994. Eie!-r Parametersfor Models 2 and4 of Table 13.2.

308 310

Goodness-of-FitStatisticsfor VariousModels of the Processof ErucationalTransitionin Japan(PreferredModel Shownin Boldface).

315

Eiect Parameters for Model 3 ofTable 13.4.

316

OddsRatiosfor a Model Predictingthe Likelihood of Marriagefrom \Ee at Risk, Sex,Race,and Mother's Education,with Interactions Bet$ eenAge at Risk and the OtherVariables. Coeillcientsfor a Model of Determinantsof Nomenklatura \Iembership,Russia,1988.

328

Efiect Parametersfor a Probit Analysis of Gun Threat(Corresponding :.r \lodels 2 and4 ofTable 13.3).

331

Ettect Parametersfor a Model of the Determinantsof English and RussianLanguageCompetencein the CzechRepublic, 1993 p Valuesin Italic.) \ : 3,945).(StandardErrors in Parentheses;

339

Eftect Parametersfor an OrderedLogit Model of Political Party Identification, U.S.Adults, 1998(N : 2,443).

345

PredictedProbability Distributionsof Party Identificationfor Black and non-BIackMales Living in Large CentralCities of Non-Southern S\lSAs and Earning $40,000to $50,000perYear.

349

XVi 14.4. 14.5. 14.6. 14.7.

15.1. 15.2. 15.3.

Tables,Figuret Exhibits,and Boxes Effect Parametersfor a GeneralizedOrdercdLogit Model of political Party Identification,U.S. Adults, 1998. Effect Parametersfor an Ordinary Least-Squares Regression Model of Political party ldentification,U.S. Adults, 199g. Codesfor Frequencyof Sex in the Pastyear, U.S. Adults, 2000. AlternativeEstimatesof a Model of Frequencyof Sex,U.S Adults, 2000 (N : 2,258).(StandardErrors in parenthesesl All CoefficientsAre Significantat .001 or Beyond.) SocioeconomicCharacteristicsof ChineseAdults by Size ofplace of Residence,1996. Comparisonof OLS and FE Estimatesfor a Model of the Determinantsof Family Income,ChineseRMB, 1996(N : 5,342). Comparisonof OLS and FE Estimatesfor a Model of the Effect of Migration and Remittanceson SouthAfrican Black Children,s SchoolEnrollment,2OO2to 2003.(N(FE) : 2,408 Children; N(full RE) = 12,043Children.)

350 354 356

357 373 374

379

FIGURES 2 .1.

The ObservedAssociationBetweenX andy Is Entirelv Spurious and Coes to Zero When Z Is Controlled.

2.2.

The ObservedAssociationBetweenX andy Is partlv Sourious: theEffecrof X on Y ls ReducedWhenZ Is Controll;d(Z Affecrs X and Both Z and X Affect Y). The ObservedAssociationBetweenX andy Is Entirely Exolained by the InterveningVariableZ and Goesto Zero When 2 Is bontrolled. The ObservedAssociationBetweenX andy Is partly Explainedby the InterveningVariableZ: the Effect of X on y Is ReducedWhen Z Is Controlled(X Affects Z, and Both X and Z Affecr y).

2.5. 2.6.

4.1. 5.1. 5.2.

Both X and Z Affect Y, but ThereIs no AssumptionRegarding the CausalOrdering of X and Z. The Size of the Zero-OrderAssociationBetweenX andy (andBetween Z andY) Is Suppressed When the Effects ofX on Z andy haveOpposite Sign, and the Effects ofX and Z ony haveOppositeSign. An IBM punch card. ScatterPlot of Yearsof Schoolingby Father,syears of Schoolins (HypotheticalDara.N : t0). Least-Squares RegressionLine of the RelationBetween Yearsof Schoolingand Father'sYearsof Schoolins.

24 24 25

26 11

88 89

T Tables, Figures, Exhibits. and Boxes XVii -.:-.:-.iuares RegressionLine of the RelationBetweenyears S:: -.-'irn,sand Father'sYearsof Schooling,ShowingHow the '::::: Prediction"or "Residual"Is Defined. '-: -;..:-Squares RegressionLines for Three Conligurationsof Data: : :-:::.rl Independence, (b) PerfectCorrelation,and (c) perfect ----. :-:;ear Correlation-a ParabolaSymmetricalto the X-Axis. -:: I-e;r of a SingleDeviantCase(High Leveragepoint). - :-'.:=:lng DistributionsReducesCorrelations. - :: iiecr of Aggregationon Correlations. of the Relationship Between --:-:: DimensionalRepresentation \::-:er of Siblings,Father'sYearsof Schooling,andRespondent,s -::--. ri Schooling(Hypothetical Data;N : l0).

90

92 95 97 99

105

:r:e;:ed \umber of ChineseCharactersIdentified (Out of Ten) , . \:,r: ol Schoolingand Gender,Urban Origin ChineseAdults Age 20 : :- ::r 1996with NonmanualOccupationsand with years of Father,s S: :l.ine andLevelof CulturalCapitalSetat TheirMeans(N : 4,g02). \::e: ihe temaleline doesnot extendbeyondl6 because thereareno :'::.".esin the samplewith post-graduate education.) 120 :,j-':pranceofAbortion by EducationandReligiousDenomination, 131 -.S. -\dulrs.1974(N : 1.481). --.-: RelationshipBetween 2003 Income andAge, U.S. Adults .{:: Ttlen*'to Sixty-Fourin 2004(N : 1,573). t4l :r-;ted 1n(Income) by YearsOf SchoolCompleted, U.S. Males Females.2004, with Hours Workedper WeekFixed at the -:: l'1i-rntbr Both SexesCombined(42.7;N : 1,459). 1,44 ir:e.-ied Incomeby Yearsof SchoolCompleted, U.S. Malesand ::neles. 2004,with Hours Workedper Week Fixed at the Mean for 3-.rhSeresCombined(42.7). 145 ::end in ArtitudesRegardingGenderEquality,U.S.AdultsSurveyed : i9r-l Through1998(LinearTrendandAnnualMeans;N=21,464). 151 f-:arsof SchoolCompletedby Yearof Birth, U.S.Adults (pooled S:mplesfrom the 1972Through2004GSS;N = 39,324;Scatter Pr.rtShownfor 5 PercentSample). 154 \lean Yearsof Schoolingby Yearof Birth, U.S. adults(SameData :i tbr Figure7.5). 155 Tluee-YearMoving AverageofYears of Schoolingby year of Birth, L.S. Adults(SameDataasfor Figure7.5). 155 Trendin Yearsof SchoolCompletedby Year of Birth, U.S. Adults SameData as for Figure 7.5). PredictedValuesfrom a Linear Splinewith a Knot at 1947. 158

XVlll

Exhibits, andBoxes Tables, Figures,

7 .9.

Graphsof ThreeModels of the Effect of the Cultural Revolution on VocabularyKnowledge,Holding ConstantEducation (at TwelveYears),ChineseAdults, 1996(N : 6,086).

7.10. 10.1. 10.2.

10.3. 10.4.

Figure 7.9 Rescaledto Show the Entire Rangeof the Y-Axis. Four ScatterPlots with Identical Lines.

163 163 226

ScatterPlot of the RelationshipBetweenX andY andAlso the RegressionLine from a Model That IncorrectlyAssumesa Linear RelationshipBetweenX andY (HypotheticalData).

227

Yearsof School Completedby Number of Siblings,U.S. Adults, 1994 (N - 2,992). Yearsof SchoolCompletedby Number of Siblings,U.S. Adults, 1994.

10.5.

A Plot of LeverageVersusSquaredNormalizedResidualsfor Equation7 in TreimanandYip (1989).

10.6.

A Plot of LeverageVersusStudentizedResidualsfor Treimanand Yip's Equation7, with Circles Proportionalto the Size of Cook's D.

lO.7.

Added-VariablePlots for Treiman andYip's Equation7. Plot for Treiman andYip's Equation7. Residual-Versus-Fitted

10.8.

Plots for Treimanand AugmentedComponent-Plus-Residual Yip's Equation7. 10.10. ObjectiveFunctionsfor ThreeM Estimators:(a) OLS Objective Function,(b) Huber ObjectiveFunction,and (c) Bi-Square ObjectiveFunction.

228

zz8 232 233 233 234

10.9.

10.11. SamplingDistributionsof BootstrappedCoefficients (2,000Repetitions)for the ExpandedModel, Estimatedby RobustRegressionon SeventeenCountries. 11.1. 13.1. 13.2. 13.3. 13.4.

13.5.

235

238

240

Loadingsof the SevenAbortion-AcceptanceItems on the First Two 255 Factors,Unrotatedand Rotated30 DegreesCounterclockwise. ExpectedProbability of Marrying for the First Time by Age at 320 Risk,U.S.Adults, 1994(N = 1,556). Risk the First Time by Age at ExpectedProbability of Marrying for (Range:Fifteen to Thirty-Six), Discrete-TimeModel, U.S. Adults, 1994. 3ZZ ExpectedProbability of Marrying for the First Time by Age at Risk (Range:Fifteen to Thirty-Six), Polynomial Model, U.S. Adults, 1994. ExpectedProbability of Manying for the First Time by Age at fusk, Sex, and Mother's Education(Twelveand SixteenYearsof Schooling), Non-Black U.S. Adults, 1994. ExpectedProbability of Marrying for the First Time by Age at Risk, Sex,and Mother's Education(Twelveand SixteenYearsof Schooling),Black U.S.Adults, 1994.

322

326

326

Tables,Figures.Exhibits,and Boxes XIX

:,:.8.1. ProbabilitiesAssociatedwith Valuesof Probit and Logit Coefficients. --+.l. 11.1. 16.1. -6.1.

ThreeEstimatesof the ExpectedFrequencyof Sex per Year, U.S. Married Women,2000 (N : 552). ExpectedFrequencyof Sex PerYearby Genderand Marital Status, U.S.Adults,2000(N : 2,258). 1980Male Disability by Quarterof Birth (Preventedliom Work by a PhysicalDisability). Blau andDuncan'sBasicModel oflhe Processof Stratification.

JJ{

358

359 386 394

EXHIBITS :. 1 :2.

lllistration of How Data Files Are Organized. A CodebookCorresponding to Exhibit4.1.

67 68

BOXES

Stata-do- Files and Jog- Files Direct StandardizationIn Earlier SurveyResearch

3 6 9 10 14 15 16 18 22 27 30 31

The Weaknessof Matching and a Useful Fix

44

TechnicalPointson Table3.3

53 54 66 70 72 75

Open-EndedQuestions SamuelA. Stouffer TechnicalPointson Table 1.1 TechnicalPointson Table 1.2 TechnicalPointson Table 1.3 TechnicalPointson Table 1.4 TechnicalPointson Table 1.5 TechnicalPointson Table 1.6 Paul Lazarsfeld HansZeisel

SubstantivePointsOn Table3.3 A Histodcal Note on Social ScienceComputerPackages HermanHollerith The Way Things Were TreatingMissing Valuesas If They Were Not

XX

Tables,Figures,Exhibits,and Boxes

PeopleGenerallyLike to Respondto (Well-Designed andWell-Administered)Surveys Why Use the " Least Squares" Criterion to Determine the Best-FittingLine? Karl Pearson A Useful Computational Formula for r A "Real Data" Exampleof the Effect of Truncatingthe Distribution A Useful ComputationalFormulafor 12 Multicollinearity ReminderRegardingthe Varianceof DichotomousVariables A Formula for ComputingR':from Conelations Adjusted R'? Always PresentDescriptiveStatistics TechnicalPoint on Table6.2 Why You ShouldInclude the Entire Samplein Your Analysis Gettingp-valuesvia Stata Using Statato Comparethe Goodness-of-fitof RegressionModels R. A. (RonaldAylmer) Fisher

17 9I 93 93 97 101 108 110 111 r1 1 114 117 122

r25 125 126

How to Test the Significanceof the Difference BetweenTwo Coefficients Altemative Ways to EstimateBIC

129

Why the RelationshipBetweenIncome andAge Is Curvilinear

140

A Trick to ReduceCollinearity

145

In SomeYearsof the GSS,Only a Subsetof Respondents WasAsked CertainQuestions

150

134

An AlternativeSpecificationof SplineFunctions Why Black versusNon-black Is Better Than White versus Non-white for SocialAnalysis in the United States

156

A Commenton Credit in Science Why PairwiseDeletion ShouldBe Avoided

175

TechnicalDetailson lhe Variables TelephoneSurveys

188

Mail Surveys

r99 200 202 205

Web Surveys Philip M. Hauser A SuperiorSamplingProcedure

175 183 198

Tables, Figures, Exhibits, and BoxesXXi St-rurces of Nonresponse ["eslieKish Hos the ChineseStratifiedSampleUsed in the Design Erperimentswas Constructed $ii,ehdng Data in Stata Limitarions of the Stata10.0 SurveyEstimationprocedure -{n -{lternativeto SurveyEstimation Ho\l to DownweightSampleSize in Stata Eirs to AssessReliability $-h1' the SAI and GRE TestsInclude SeveralHundredItems TransformingVariablesso That ,,High,'has a ConsistentMeaning ConstructingScalesfrom IncompleteInformation h Log-LinearAnalysis "Interaction',Simply Means ,Association,, l: Defined Other Softwarefor EstimatingLog-Linear Models \larimum Likelihood Estimation ProbitAnalysis Techdcal Point on Table 13.1 Limitations of Wald Tests SmoothingDistributions EstimatingGeneralizedOrder Logit Models With Stata JamesTobin PanelSurveysin the PublicDomain Otis Dudley Duncan SewellWright -\sk a Foreigner To Do It GeorgePeterMurdock ln the United States,Publicly FundedStudiesMust be Made Available to the ResearchComrnunity Al'Available from Aulhor" Archive

207 ?08 212 2,13 215 219 219 244 245 248 249 264 267 294 302 302 305 309 325 349 354 369 395 396 398 401 404

, -, ,_ _ :l ,:-i ,

,"

.a.

: , :. a book abouthow to conducttheoreticallyinfomed quantitativesocialresearch ":-: .. socialresearchto testideas.It derivesfrom a coursefor graduatestudentsin sociprofessionalschools(public -:, .rnd other social sciencesand social science-based -.-----. education,socialwelfare,urbanplanning,and so on) that I havebeenteachingat - -.t tbr somethirty years.The coursehasevolvedasquantitativemethodsin the social , ::::s haveadvanced;early versionsof the coursewere basedon the first half of this -.., r throughChapterSeven),with additionalmaterialsaddedover the years.Interest:-:-.. I havebeenableto retainthe sameformat a twenty-weekcoursewith onethree::-: -e.tureper week and a weekly exercise,culminatingin a term paperwritten dudng --i .-it lbur weeksof the course from the outset,which is, I suppose,a tributeto the --.:=sing level of preparationand quantitativecompetenceof graduatestudentsin ::= .-..ial sciences.The book owes much to lively classdiscussionsover the years,of :: :ubtle andcomplexmethodologicalpoints. tsr rheendof the book,you shouldknow how to makesubstantive senseof a body of data. you That is, prepared should be well produce to publishable papersin -:-,:::ative :-: neld. as well as first-ratedissertationchapters.Of course,thereis alwaysmore to :=:. In the final chapter(ChapterSixteen),I discussadvancedtopics that go beyond ; '.: .an be coveredin a first coursein dataanalysis. Tie focusis on the analysisof datafrom representative samplesof well-definedpop- ,:, rns.althoughsomeexceptionsareconsidered.The populationscanconsistof almost societies,occupations,pottery shards,or what--l -:-rns people,formal organizations, ::. ihe analytic issues are essentially the same.Data collectionproceduresare men- :J only in passing.Thele simply is not enoughspacein an alreadylengthybook to do -.::re to both data analysisand datacollection.Thus, you will needto look elsewhere r .i stematicinstructionon data-collectionprocedures. A strongcasecan be madethat .hould do this after rather than before a courseon data analysisbecausethe main :. : ---emin designinga data collectionefforl is decidingwhat to collect, which means - irst needto know how you will conductyour analysis.An altemativemethod of :--:ring aboutthe practicaldetailsof datacollectionis to becomean apprentice(unpaid, : ,:;essary) to someonewho is aboutto conducta surveyand insistthat you get to par,:::ate in it step-by-step evenwhenyour presence is a nuisance. Thisbookcoversa varietyoftechniques,includingtabularanalysis,log-linearmodels r :abulardata,regressionanalysisin its variousforms,regressiondiagnosticsandrobust -.::-\sion, ways to cope with missing data,logistic regression,factor-basedand other :::.niquesfor scaleconsnxction,andfixed- andrandom-effects modelsasa way to make ,.-.al inferences.But this is not a statisticsbook; the emphasisis on usingtheseproce:-:;s to drawsubstantive conclusionsabouthow the socialworld works.Accordingly,the :' .-.kis designedfol a courseto be taken after a first-yeargraduatestatisticscoursein -: rocial sciences.Although thereare many equationsin the book. this is becauseit is

XXIV

Preface

necessa.ry to understandhow statisticalprocedureswork to usethernintelligently. Because the emphasisis on applications,there are many worked examples,often adaptedfrom my own research.In addition to data from samplesurveysI haveconducted,I also rely heavily on the GeneralSocial Survey,an omnibussurveydesignedfor use by the research community and also for teaching.Appendix A describesthe main data sets used for the substantiveexamplesand provides information on how to obtain them; they are all availablewithout cost. The only prerequisitesfor successfuluseof this book are a prior graduateJevelsocial sciencestatisticscourse,a willingnessto think carefullyandwork hard,andthe ability to do high school algebra-either rememberedor relearned.With only a handful of exceptions (referencesat one or two points to calculus and to matrix algebra),no mathematics beyondhigh school algebrais used.If your high schoolalgebrais rusty, you can find good reviews in Helen Walker, Mathematics Essential for Elementary St,,tistics, and W. L. Bashaw,Mathematicsfor Statistics.These books have been around forever. Although more recent equivalentsprobably exist, school algebra has not changed,so it hardly matters.Copiesof thesebooksarereadily availableat amazon . com, andprobably many otherplacesaswell. The statisticalsoftwarepackageusedin this book is Srara(release10). Downloadable commandfiles (-do- files in Stata'sterminology),files of results(-1og- files), and ancillary computer files used in the computations are available at wwwjosseybass. conr/golquantitativedataanalysis Often the details underlying particular computationsare only found in the downloadable do - and - 1og - files, so be sureto downloadandstudythemcarefully.Thesefiles will be updatedasnew releasesof Statabecomeavailable. I use Statain my teachingand in this book becauseit has very rapidly becomethe statistical packageof choicein leadingsociologyand economicsdepartments. This is not accidental.Statais a fast and efficient packagethat includes most of the statistical procedures of interest to social scientists,and new commandsare being addedat a rapid pace. Although many statistical packagesare available, the thrce leading contenderscurrently are Stata,SPSS,and SAS. As software,Statais clearly superiorto SPSS-it is faster, more accurate,andincludes a wider rangeof applications.SAS, althoughvery powerful, is not nearly as intuitive as Stata and is more difficult to learn (and to teach). Nonetheless, this book canbe readilyusedin conjunctionwith eitherSPSSor SAS, simply by translating the syntaxofthe Stata-do- files.(I havedonesomethinglike this,exploitingAllison's excellent,but SAS-based,expositionof fixed- andrandom-effects models[Allison 2005] by writing the correspondingStatacode.)

FORINSTRUCTORS Somenotes on how I have usedthesematerials in teaching may be helpful to you as you designyour own course. As noted previously,the courseon which this book is basedruns for two quarters (twenty weeks). I have offered one three-hour lecture per week and have assignedan exerciseeveryweek.When I fust taughtthe course,I readtheseexercisesmyself,but as

Preface

XXV

:-:: -::rentshaveincreased,I haveenjoyedthe servicesof a T.A. (chosenfrom among -::.-:. $ ho haddonewell in the coursein previousyears),who assistsstudentswith the : .::ies of computingand statisticsand also readsand commentson the exercises.In lecturesandhaveassignedexercisesfor all but the :::r: \eais. I haveofferedseventeen '.'' the course devotedto producingtwo draftsof a term paper -. -- :ih rhe final monthof : :::rJirihon sessionI readthe first draftsandwrite comments,in an attemptto emulate : : - : -:nal submissionprocess.Thus, in my course,everyonegetsa "reviseand resub::-: :i>ponse.I encouragestudentsto developtheir telm papersin the courseof doing andto completetheir draftsin the two weeksafter the lastexerciseis due. -:= -.:::ises l;-: initial exercisesare designedto lead studentsin a guided way through the , :-:::rics of analysis,and someof the later exercisesdo this as well. But the exercises - -::-.:nglr take a free form: "carry out an analysislike that presentedin the book." ,,:-.:ir e answersareprovidedfor thoseexercisesthat involvedefinitiveanswers that , ,- .3 sin-Iilarto statisticsproblemsets. -:3 .oursesyllabus,weekly exercises,andillustrativeanswersto thoseexercisesfbr i:-[ have written illustrative answersare availablefor downloadingfrom www. : ::_.r.i:s.com/go/quantitativedataanalysis

ACKNOWLEDGMENTS -,. , r:3dearlier,this book hasbeendevelopedin interactionwith manycohortsof gradu-:. .::dents at UCLA who havewrestledwith eachof the chaptersincludedhere and :- . :erealed troubles in the exposition, sometimesby way of explicit comments -- - : r:nerimesvia displaysof confusion.The book would not exist without them, as I :: :: -naginedmyselfwdting a textbook,and so I owe themgreatthanks.Onein partic---. ?.rmelaStoddard,literally causedthe book to be publishedin its currentfolm by : ::-.:ing in the courseof a chanceairplaneconversationwith Andrew Pastemack,a ...' , -Bassacquisitionseditor,that her professorwas thinking of publishingthe chap.. . : usedas a coursetext.Andy contactedme, andthe restis history. h: courseon which this book is basedfirst cameinto being throughcollaboration i -: :r] colleagueJonathanKelley, when he was a visiting professorat UCLA in the - - .. The first exerciseis borrowedfrom him, andthe generalthrustof the course,espe- - -. :re lirst half, owesmuch to hrm. \ly colleague,Bill Mason,recentlyretiredfrom the UCLA Sociologyand Statistics -..:::rients, hasbeenmy statisticalguru for manyyears.Otien I haveturnedto him lbr :: i::s irto difficult statisticalissues.And much that I have learnedabout topics that ;: : :roi part of the cuniculum when I was a graduatestudenthas beenfrom sitting in -,: ::red statisticscoursesofferedby Bill. Anothercolleague,Rob Mare, hasbeenhelp-- -. :nuchthe sameway.My new colleague,JennieBrand,who took over my quantita- : :;ia analysiscoursein the fall of2008, hasreadthe entiremanuscdptandhasoffered relptul suggestions. Finally, the book hasbenefitedgreadyfrom very carefulread--.,. .: :l' a group of about 100 Chinesestudents,to whom I gavea specialversionof the , --:.: in an intensivesumner sessionat Beijing University in July 2008.They caught

XXVI

Preface

ftmy errors that had gone unnoticed and mised often subtle points that resulted in the reworking of selectedportions of the text. My understanding of research design and statistical issues, especially conceming causality and theats to causal inference, has benefited greatly from the weeHy seminar of the Califomia Center for Population Research,which brings together sociologists, economists, ald other social scientists to listen to, and corrment on, presentationsof work in progress,mainly by visitors from other campuses.The lively and wide-ranging discussionhasbeen somethingof a floating tutorial, a realization of what I haveimagined academiclife could and should be like. Finally, my wife, Judith Herschman,has displayed endlesspatience, only occasionally asking, "When are you going to finally publish your methodsbook?"

. : & LJYh**t H t Treiman is distinguishedprofessorof sociologyat the Universityof Califomia u --s 1:.:-:s rLCLA) andwas until recentlydirectorof UCLA's Califomia Centerfor aorurr,:r Re:earch.He hasa BA from ReedCollege(1962)and an MA andphD from ! -n-.-.-:r .-'fChicago(1967).As a graduatestudentat Chicago,he spentmostofhis .f, \aiional Opinion ResearchCenter(NORC), wherehe gainedvaluabletrain_ -: .Er:- :1-nence in surveyresearch.He then taught at the University of Wisconsin, rntae :l :e,-ided that he really was a social demographerat heart, and made the Center ru }:,-1:rrph1 and Ecology his intellectualhome. From Wisconsin,he moved to I 'rrrrn-; Lnirersitv and then, in 1975,to UCLA, wherehe has beenever since,albeit qd E\i=J-1 so.;ournselsewhere,as staff director of a study committee at the National r;rrr='. .:: Sciences,4.Jational ResearchCouncil (1978-1981)and fellowship yearsat Bl:eau ofthe Census(1987-1988), theCenterfor AdvancedStudyin theBehav_ ---i umr rc S.r-ialSciences(1992 1993),andthe NetherlandsInstitutefor AdvancedStudy r M and SocialSciences(1996-1,997). l::--.or Treiman startedhis careeras a studentof social stratificationand status --::rrniries il.!yn-..:-- parricularlyfrom a cross-national perspective,and this has remaineda con_ i'Fr._r :::3resr.He andhis Dutch colleague,Harry Ganzeboom,have beenengagedin a {mr--€:= project to analyze variations in the status attainment Drocess --ross-national [irrlr. :::!-lD! throughoutthe world over the courseof the twentiethcentury.To date, tEl r:-,: ;ompiled an archiveof more than 300 samplesurveysfrom more than 50 m:cs- =ngrns through the last half of the century. In addition to his comparativeproj_ s ?:: -::sor Treimanhas conductedlarge-scalenationalprobability samplesurveysin ir@ \--.,-a | 1991-1994),EastemEurope( 1993-1994),andChina(1996),all concemed q [ -.J::.u! aspectsof socialinequality. :lj .Lrent researchhasmovedin a more demographicdirection.He hasa national !r.rr!---'::\ lample surveycurrentlyin progressin China,which focuseson the determ! m.- :i:amics. andconsequences of internalmigration.

:r,{-rK*milcT-l*ru I -. :or uncommonfor statisticscoursestakenby graduatestudentsin the socialsciences x :E [eated essentiallyasmathematicscourses,with substantialemphasison derivations rnc:roofs. Evenwhenempiricalexamplesareused-which they frequentlyarebecause -

20,000 I5 ,0 0 0 l 0 ,0 0 0 s,000

rid:-: 0

: .i ; ": . The RelationshipBetween 2OO3Income and Age,U.S. Adults Twenty to Sixty_Fourin 2OO4(N = 1,57 . +p o:= 0a::-

_': '

T:::';:i:.ffi*;ilJ:"#tff T$Hffi:l:*ill? =in d*:i:i:ffi::

ro:.rr550,000 peryear.Withoutthelraph,h";;;, ir":;r because the coefficients the

;;*r"tili"#equation

z.z is oir_

.nr.=pretarion. rtispossibre, .F-.t rnterpretation. It can be"''.;i,::i:nil";11,l.il:;il'j'.r.lf"ilf":1l1i:,1'; shownthat in the equation (.7.3)

F:ere /r : a - b2,/4candF = _b/Zc (\Nith thecoefficients on the right side taken from i::arion 7.1),z is themaximumincome,-and F is the ageat whlcfrtfremarlmum lncome : =:rained.In the presentcase,the numerical esti."";{;;,#;.; I = 50,066- 35.95(52.53- A),;

" R, = .084

(7.4)

Equations7.Z md 7.4, of course,,yield the samegraph becausethey are equivalent But Equation7.4 also_ telli us precisely,#i;; -;ressions. rs ;;;ome 1j50,066 and lrr: rhispeakis attainedbetweenfifty-,*o unOnfry_,-tr.. ,"i. r*" tp*"isely, 52.53). \n equivalenrrransformationis possibl" "i f". adaitionatinde_ "; "q;;;;italirng Consideran equationof the form -.E:dentvat:iables.

Y- a+ b(A)+ c(Ar)+ d(z)

(7.5)

142

Quantitative Data Analysis:Doing Socoeffir can be needto educ r is not otheses fcance ing the to esti-

Y=d'+bT+

\--

z-

(7.33)

s-here ? is a linear representationof time (here, the year of the surveyJ,and the Z' are dummy variablesfor eachyear the survey was conducted; note that two dummy variables mustbe omitted becausethe linear term usesup one degreeof freedom. We then compare de two-models in the uslal via an F+esi of th" ,ignin"_"" ot,t e increment _waV, in R2 and a comparison of BIC valuesA convenientway to Jo the first in Stata is to estimate Equation7.33 and then to test the hypothesis,fr"i af tfr" ,2.il1, zero,vraa Wald test using Stata's - test - command. (Note that equution", "l"a smply a different parameterizationof an equation in which the linear ierm is omitteJand oniy the dummt are included. The coefficients will, of course, Olff".. nui tt p."dicted values, 'ariables R:, and-81Cwill be identical.) If w.econclu.le that " no simpf" fln"ar ,r"nO no ,he data, we mrght then posit either a model with a-smoothcurve by inifoJirrg u ,qr*.a t"rm for Z, or a model that tries to model particular historical events by g.oupiig y"_, ioto historically meaningful groups and identifying each group ltess one') ui'u u'a'orn_y variable, or a splinemodel (seethe section"Linear Sptines,iater in tne inuf".;i""uur" ,he explainedby Equation7.33 is the maximum possible '-y'."-p."."ntution variance ftom of tr-" (measuredin years), the R'?associated with Equation ?33 ;;, ;', a standard against which to assess,in substantiverather than .t i"tty rtatr.ti"ail"r-1, ro* close various

1 50

to Testldeas QuantitativeDataAnalysis:DoingSocialResearch

sociologically motivated constrainedmodels come to fully explaining temporal variation in the dependentvariable. Although, to simplify the exposition, I have not included any variablesin the model other than time, a model actually positedby a researcherqpically would include a number of covariates(otherindependentvariables)andalso,perhaps,interactionsbetweenthecovariatesandthe variablesrepresentingtime. Exactly the samelogic would apply to suchan analysis asto the simpleranalysisjust described;the logic is alsoidentical to the dummy variable approachto the assessment of group differencesdescribedin the previouschapter(although herethe "groups" are yearsor, if warrantedby the analysis,multiyear historical periods).

Prediding Variation in Gender Role Attitudes over nme: A Worked Example Four items on attitudesregardinggender-roleequality were askedin most yearsof the GSS between 1974 and 1998.The four variablesare shownhere with the percentageendorsing the pro-equalityposition,pooledover all yearsin which all four questionswere asked: r

Do you agree or disagreewith this statement?Women should fake care of running their homesand leaverunning the country up to men (74 percentdisagree).

r

Do you approveor disapproveofa married woman earning money in businessor industry if shehas a husbandcapableof supporting her? (77 percent approve).

r

If your party nominated a woman for President, would you vote for her if she werequalifiedfor thejob? (84 percentsayyes).

r

Tell me if you agreeor disagreewith this statement:Most men are better suited emotionallyfor politics than are mostwomen(63 percentdisagee).

To form a gender-equalityscale,I simply summedthe pro-equality responsesfor tbe four items, excluding all people to whom the questionswere not askedand treating other noffesponsesas negativevalues.The point of treating "don't know" and similar responses asnegativevaluesrather than excluding them is to savecases.But this would not be wise if therewerenot substartivegroundsfor doing so-in this case,it seemedreasonableto me to treat "don't know" as somethingother than a clear-cutendorsementof genderequality.

?,I N

l

h

rN SOMEYEARSOFTHEGSS,ONLYA SUBSET OFRESPONDENTS WASASKEDCERTAIN QUESTIONSusersor the GSSneedto be awarethat to increase the numberof itemsthat can be includedin the G55 each year,some items are askedonly of subsetsof the sample.A convenientway to excludepeoplewho were not askedthe questionsis to usethe Stata-rmiss - option under the -egen- commandto countthe numberof missingdataresponses and then to exclude peoplemissingdata on all itemsincludedin a scale.However,in the currentanalysis I excludedall thosewho lackedresponses on any of the four itemsbecausesome,but not all. of the questionswere askedin someyears.

MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 151 u'al variation in the model rde a number enthecovarigrch an analmry variable ner(although lpenods).

s of the GSS geendorsing asked: care of runntdisagree). rbusinessor approve). r her if she

EstimathgequationssuchasEquations7.32and,7-33suggestssignificantnonlinearities in attitudes regarding gender inequality. The increment in R, implies F = 3.54 with 11and,21,448d.f.,which hasa probabilityof lessthan0.0001.Howevel the B1Cfor the lrnear trend model is more negative than the B1C for the annual variability model 'de BlCs are,respectively,-959 and -871), suggestingthat a lineartrendis morelikely siyenthe data.BecauseB1Candclassicalinferenceyield contradictoryresults,a sensible Fxt stepis to graph annualvariations in the meanlevel of support for genderequality, to Jee whether there is any obvious pattem to the nonlinearity. If substantively sensible deviationsfrom linearity are observed,the annual variation model might be accepted,or e new model, aggregatingyears into historically meaningful periods, might be posited ,teeping in mind the dangersof modifying your hypothesesbasedupon inspection of the dan-see the discussionof this issueat the end of ChapterSix), or a smoothcurve or spline function might be fitted to the data. Figure 7.4 showsboth the Iinear trend line and annualvariations in the mean.Inspecting the graph, it appearsthat deviationsfrom lineariq are neither large nor systematic. Given this, I am inclined to accept a linear trend model as the most parsimoniousrepresentationof the data, despitethe F-test results.The lineartrendis, in fact, quite substantial,implying an increaseof 0.81 (= .0338*(19981974))over the quarter of a century for which we have data; this is about 20 percent of 6e range of the scaleand is about two-thirds of the standarddeviation of the scalescores. -\pparcntly, support for gender equality has been increasing modestly but steadily ftroughout the closing years of the twentieth century. From a technical point of view, it may be helpful to comparethe estimatesimplied by rhe two altemative ways of representingdepa.rturesfrom linearity: Equation 7.33 and the

Etter suited nsesfor the tating other |I responses * be wise if ble to me to quality.

+ -

Llneartrend Mean fo. year

6 62

RES. ,sers of t in the way to l under xclu0e ls l e xta ll, of

z

1914 1976 ',19781980 19a2 19a4 1986 1988 1990 1992 1994 1996 1998

Yearof survey

FiGUfl€ 7.&, rrendin AttitudesRegardingGenderEquatity,U.s.Adutts Surveyed in 1974Through1998(LinearTrendandAnnualMeans;N = 21,464).

152

euantitativeDataAnalysrs: DoingSocialResearch to Testldeds

altemativespecificationthat doesnot incrudea linear term for year.when the rinearterm is included,two dummy variabrecategones aredropped,ratherthanone,becausethe lin_ ear terrnusesup one degreeof freedom. However,,ir" t*o pro""J*". produceidentical results. is evidentfrom inspectlon of Table7. I . ^which untortunately. thereis no simp^lecorrespondence betweenthe coelficientsin equa_ trons rhe form of Equation z.j: o"viation, ;;; .of ;;';;i"hons of rhe tinear .ana equatron.If you want to show annual depanuresf.;h;;y:;; needro construcra new variable, which is the difference u.ir""" ,fr" pr"al"i.J J"fi", ,", each year from Eq,tation7.32 and Equation 7.33.^This i..u"ry ui""_oirJf, in Srata using rhe - foreach- or -forvalues_ cornmand. "ufi1o

I

t , |. !

I

LINEARSPLTNES I

Somedmeswe encountersituationsin which we believe that the relationshipbetween tu.o variableschangesabruptly at some point on the distribution of the independentvariabre. so that neither a linear nor a curvilinear representationoi-,i" l"fu,ionrrup is adequate. c_olsurnlriyn.may have no impact on qlcofol l"arf, U"to* some rhreshold. l:l,"]-"-pr" whereasabovethethresholdheatthdecline. i" li""*;;y;; ;;;;i consumptionincreases. Temporaltrendsalsomay abruptrycnange, " asa result of policy changes,cataclysnic evenl. suchas depressions,wars,revolutions, and so on. In casei of this kind, it is useful to representthe relationshipsvia a setof connected line segments,know:nu tin"o, ,plir"r.

A Worked Example:Trendsin Educational Altainment over Timein the united States

form?

_._l*^"ji il::fTfi::::Ti:,":llilffii I,r::.*" :""pr.,,n" ::ilT'ru.,i"";:;

showssucha plot, madewith the same specificationsas the scatterplot. Inspecting the y"",r::^,h"."verage educationin"r"u."a in u _or" o. i"rrlr"l_ rv"}, ,or thoseborn 11,,: between1900and 1947bur rhenlevel"aoff. n""uu." rt a bit, prob_ *" relarively ";l;;l;;;;. smau nu_b".;;; f;;;;; -""rd

Jffi

I

!

s 1, l>r-:

c

a

=v

a/

rtmigrrtueueuer to

hie*"uirii"J;;#ft?:l#;:?"::il*TTJ:i"*"llifJ.ffi ii:Tl,i: - do - and - 1os- files.) rnspecting this graph,*";;;;; ;;;;; conclusion_there

is a fairly abruptchansein ihe trend, wittr it os" b",..' il;;;;;;df

2

er < :-=

,;:"ffi:'i: tfere appears," b";;il;;;,::ruffis: i:1,,'# ;'"ffiff tffi* discem-is it linear or is the trendbetter representedby someotherfunctronal

:,0]la3:iy "r moving averase plot three-year

I

rt

!\

consider chalges in the averasereverof educationover time. Figure 7.5 presents a scafter plot relating educationalattainirent to year of birth, estimatedfro;'trre css. To create rhis graph,I combineddatafrom all vears betw.* 1972;;;6;.r"rv*"r, , *"0*o those bom prior to I 900 becausethe very small sampl" .ir* p;il;;;u"ii"".*"ro. , a., droppeda thoselessthanagetweng/-fivearthe rime of rh" J;";;;;;;iluiy"i"opr" ao not their schoolinguntil rheir mid-twenries. Th. ;d; i;;"; "o_pt.t. * , cases, ,Jittered,' oiJ"i ro make it readable,andis to ma_k "r*" To discoverhow rhe increasr

a

of the rwentierh

-= -a

Ito r r to ttr tr .t

llo n

I h d I ltrcl url orl

.,I .r I t.tt,..r l ot tl t l ).rot

N (rl A116r I l'r|'(lt(t.'(l

V.rhror.

Coef f ic ient i = a + I-rr " - i' i

1975 1977

1998

z .t 6r4 3 * C tg tt:2 .5 1 1 4 * O :OOA I = 2 51q5

a. + D .19/5 + c,u,.s: _j j .68578+ O.O375B 72* 1975+ 0. 0403799: 2. 5893 i , + bl i g77 + c,pn:.-71..68578 _ ^111c^io T + .A 37sa7i * 1q-7-7 U. |] J6418= 2. 5105 ),,,. . . ._,,. ..v)/)otz.tel /

154

QuantitativeDataAnalysis:Doing socialResearch to Testldeas

m]Il|::e

!.I J.... 16

:t; 1..

E 6

=

i

tz

o

't: . i..*.-.' 1900

1910

.'J.

1920

1930

194A 1950 Yearof b rth

1960

1970

fll${"}gti:ir.5.

Yea6 of SchoolCompletedby Yearof Birth, I).5. Adults (Pooled Samplesfrom the 1972Through 2004 GSi;N = 39,324;ScatterPlot Shown for 5 PercentSample).

re-m . ibsf

century(precisely,until 1947)experiencinga fairly steadyyear-by-yearincreasein their schooling,but thoseborn in 1947or later experiencingno changeat all. This suggest: that the trend in educational attainment is appropriately representedby a linear spline with a knot at 1947,where"knot" refersto the point at which the slopechanges. This specificationcanbe represented by an equationof the form: E - a'l br(Br)+ b"(8,)

(7.3+

whereBr - the yearof birth for thosebom in 1947or earlierand : 1947otherwise,and B, - the year of birth - 1947for thosebom after 1947and : 0 otherwise.More generalll.. a splinefunctionrelatingZto X with segments vt. . .!,*t andknotsatkr k2,. . . ,k,can be reDresented bv

Y : a'l br(X,)+ b,(Xr)+... + b,*lx,*)

(7.35r

wherev, : min(X k,), u, - max(mintX- k,. k, k1).0),.. ..urr+rr: max(X f,,0)(see Panis [1994]; the entry for Stata's -mksplinecommand lstatacory 2007]; and Greene[2008]).Eachslopecoefficientis thenthe slopeof the specifiedline segmenr.We can seethis concretelyby going back to our example,Equation7.34, and evaluating the equationseparatelyfor thosebom ir 1947or earlrerand thosebom after 194j. Fot thosebom in 1947or earlier,we have

.1

I

rc-m &im.,rll

xultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems f 55

14

d

l

13 o

12 11 10 9 8 1930 1940 1950 Yearof binh

old

FIGURE 7.6. ueu, yearcof Schooting by yearof Birth,u.s- Actutl(same hta asfor Figure7.5).

6eir 5e$s pline I E

I 14

13{t

ad tll]'. can

!-.

12 11

g

10

35r EC trd ['e

rg br

';

9 81930

1940 1950 Yearof binh

FlGtrRE 7.7. Three-year MovingAverage of yearsof schooling byyearof Birth, U.5.Adults (SameDataas for Figure7.5).

156

DataAnalysis: Quantitative DoingSocialResearch to Testldeas

E = a + b,(B)+ br(0): a + br(B) andfor thosebom laterthan 1947, we have E: a + b,(1947) + b,(B-1947) : (a + 1947br) + b2@-1947)

(7.37)

Notice that the intercept in Equation 7.37 is just the expectedlevel of educationfcr thoseborn in 1947 aadthatbrgives the slopefor thosebom after 1947.Thus, the expected level of educationfor those 6om in 1948 is just the expectedlevel of educationfor those bom in 1947 plus Dr; for those born in 1949 rt is the expectedlevel of educationfa thosebom in 1947plus 2br; and so on. Estimating Equation 7.34 from the pooled 1972-2004 GSS data yields the coefficients in Table 7.2. By inspecting the BICs for three models-the spline model, a linea trend model, and a model that allows the expectedlevel of schooling to vary year_b).year-it is evident that the linear spline model is to be prefened. Note, however, that a comparisonof R2sindicatesthat by the criterion of classicalinference,the model posirhg year-by-year variation in the level of schooling fits significantly better than the splbe model. I am inclined to discount this result becauseit has no theoretical iustification. is

SpECtFtCATtON OF SpLtNEFUNC4N ALTERNATTVE

TION S Analternative specification represents theslopeot eachlinesegment asa deviation fromtheslopeof theprevious linesegment. Inthisspecification, a different setof newvariables is constructed. Suppose therearek knots,thatX istheoriginal variable. andthatyr,...,yh+r) arethe constructed variables. Then

ur= X - k.,if X> k,; :0 otheruuise u,,, = X - k.il X > k"; : 0otherwise To seethis concretely, considerthe presentexamplespecifyinga knot at birth year 1947 in the trend in educationalattainment.We would estimatethe equationwherez, : birthyear(, andur: X 1947if X > 1947and = 0 otherwise.Thenfor thoseborn in 1947or earlier, t : a + b,(X) + b,(o) : a + b1(X) whilefor thoseborn laterthan 1947 E: a + b,(X) + b2(X- 44 Thus,for those born in 1948,the expectedlevelof educationis given by (a + 48b,) + b,; for thoseborn in 1949it is (a + 4gb1)+ 2br;and so on. Fromthis,it is evidentthat b, give; the deviationof the slopefor the previouslinesegment.Forusefuldiscussions of thesemethods,seeSmith(1979)and Gould(1993).

r.'!ultiple Regression Tricks: Techniques for Handringspeciar Anaryticprobrems 157 '-. -16

-'

r a,S ?.3, Co.ffi.i.rrts fora LinearSplineModel of Trends years in of School Completed by year of Birth, U.S. Aiults age iS Ofa.., comparisonswith other Moders(pooredDatar". "na = ".rO r6'iz-zoo+,

rv

r f. , 3Ltion tbr rPecred v *lose tion lbr e-oeffia linear ear-b] : that a osiring spbne tion. is

39,324),

s,e. 5bpe '. :: ,:..'i'.: 5.ope(bjrthyearsI94j-1979)

,i.:: .0092

.o024

r*""u1,rr,1.,:, Model Comparisons

2) Lineartrendmodel

.1167

(3) :. i5

I ) vs.(2)

-5 3 1

.0121

545.2

1;39321 .OOO0

:-:arly inferior by the BIC criterion,and occurs simply as a consequence of the large i-mple size.Thus, I acceptthe linear splinemodel asttrepreteneJmoO"t. The coefficientsfor the line segmentsindicatethat for peopt"iorn in 1947or earljer, :ere is an expectedincreaseof .0g6yearsof schooling foi ,*""rriu" birth cohort. .._. us.peopleborn lwelve yearsapartwould be expectld "uin to differ on averageby abouta 1:ar of schooling.However,for peopleborn in 1947or later,;";" ." trendin educa-:.rnalattainment;the coefficient .0092 implies ttut it *ouiO iut "about a century for -:. eraqeschoolingto increaseby one year This is a somewhatsurpnsrng result, espe_ :::lly becausetherehavebeensu

.":nraged minoriries, rhat is, ",""1::#'iff'":!ilT":fi:-:rTli:,ffiH:tr"Hi,

::d also amongwornen.However,as Mare (1995, tb:; not"r-.d*utronally disadvan_ --!ed proportionsof the population havegrown over tim"..tutlu" to tn" White majority. )saggregation of the trend woula be wolrthwhile u* p"r.ued here;it would :rte,an interestingpaper The graph implied by the"""".ii. coefficienisfor the linear spline :Lrdel is shown.inFigure 7.g, togetherwiih u 2 i".".nt rundornslmpte of observations : rr eachcohort_(redlced from 5 percentto 2 percentto mut. it .J". to seethe shapeof :e.spline). In this figure the -j itterfeaiurein Stutui, u."J'io _uke it clearwhere : rhegraphthereis the greatestdensityof points.

158

DataAnalysis: DoingSocialResearch Quantitative to Testldeas

". :,t'..-i i f

?.... .'.t'

lfi

_g E .r

t

t.

-iifir-Er,

t

flrcfl diiMd-

o fr

N[dd dbi

libu

%i

o

llrry

btrt

It

hr

l$.-" .!F, tEd l

'1900

1910

1920

1930

1940 1950 Yearof birth

1960

1970

1980

@m h ftr/rqi trtil

Ff &Untr 7 .&, rrenain Yearsof SchootCompteted by yearof Birth,U.S.Adutl (SameDataasfor Figure7.5;ScatterPlotShownfor 2 percentSampte). predicted Valuesfrom a LinearSplinewith a Knot at 1947.

tuq drF

A SecondWorked Example,with a Discontinuity: euality of Education in China Before, During, and After the Cultural Revolution The typical useof splinefunctionsis to estimateequationssuchasthe onejust discussedin which all points are connectedbut the slope changesat specifiedpoints (,,knots"rHowever, there are occasionsin which we may want to posit discontinuoas functionsThe Chinese Cultural Revolution is such a case.It can be argued that the disruption of socialorder at the beginningof the CulturalRevolutionin 1966was so massivethat it js inappropriateto assumeany continuity in trends.Deng and Treiman (1997) makejus such an argument with respect to trends in educational reproduction. They argue thal there was then a gradual 'tetum to normalcy" so that changesresulting from the end of the Cultural Revolution in 1977 were not nearly as sharp and were appropriately representedby a knot in a spline function rather than a break in the trend line. Here we consideranotherconsequence of the Cultural Revolution,the quality of educationreceived(the exampleis adaptedfrom Treiman [2007a]).Although prima4 schoolsremainedopen thoughout the Cultural Revolution,higher level schoolswere shutdown for varyingperiods:most secondaryschoolswereclosedfor two years,from 1966to 1968,and most universitiesand other tertiarylevel institutionswere closedfor six years,from 1966 to 1972. Moreover,it was widelv reDortedthat even when the

m-

lr"D. lhEb

h br& {@E

fu r frFfr ffi{

ryE'ft bd rlidh' Ed &trI

hr mb &'n|n b

litultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 159

0 ldutts Pd

n hste.l lol-s- 1,

:tiorr. ion e.: ar it i: Ie JL.:i |e thar end of reprelir-r e-ti

rima+ t $ efe

. from ed for en lhe

siools were open, little conventionalinstruction was offered: rather, school hours rere taken up with political meetingsand political indoctrination.Rigorousacademic himrction was not fully reinstituteduntil 1977, after the death of Mao. Under the ;iriumstances,we might well suspectthat, quite apart from deficits in the affiount of siooling acquiredby thosewho wereunfortunateenoughto be of schoolageduringthe Culmral Revolution period, those cohorts also experienced deficits in the quality of $ooling comparedto thosewho obtainedan equalamountof schoolingbeforeor after fre Cultural Revolution. To test this hypothesis,we can exploit the ten-item characterrecognition test ,&iristered to a nationalsampleof Chineseadultsthat was also analyzedin Chapter SLx(seeTable6.2). As before,I take the numberof characterscorrectly identified as a of literacy andhypothesizethat, net of yearsof schoolcompleted,peoplewho -asure rned age eleven during the Cultural Revolution would be able to recognizefewer [Laractersthanpeoplewho turnedelevenbeforeor after the Cultural Revolutionperiod. Uoreover,following Deng andTreiman (1997),I posit a discontinuityin the scoresat tu beginningbut not at the end of the period. To do this, I estimatean equationof fre form:

i - a + b1(B)+ bz(B) + cr(Dr) + \(\)

(7.38)

rhere B, = year of bfuth (last two digits) if born prior to or in 1955 and : 55 ifbom Fbsequentto 1955;Br: 0 if bom prior to 195(, = year of birth - 55 if born between 1956and 1967,inclusive, and : 67 - 55 if bom subsequentto 1967',83: 0 if bom = 0 for lrior to or in 1,967and : year of birth - 67 for those bom after 1967i and D, : 1955. Note difference prior 1 for those bom after that the born to or in 1955 and 6ose henveenthis and Equation7.35 is that I include a dummy variableto distinguishthose born after 1955from thoseborn earlier;this is what permitsthe line segmentsto be disat 1955.If I were to havepositeda discontinuityat 1967as well, the equarr')otinuous :ion would be the mathematicalequivalentto estimatingthree separateequations,for rte periodbefore,during, and after the Cultural Revolution,in eachcasepredictingthe rtrmberof charactersrecognizedfrom yearsof schoolingand year of birth. The advanage of equationssuchas Equation7.38 is that they permit the specificationof altematire modelswithin a coherentframeworkand by so doing permit us to selectbetween nodels. Estimatingthis equationyields the resultsshownfor Model 4 in Tables7.3 and 7.4. -{s in the previous example,I contrastmy theory-driven specification with other possibiliries: that there is a simple linear trend in the data; that there are year-by-yearvariations; tat there are knots at both the beginning and the end of the Cultural Revolution, but no discontinuities; that there are discontinuities at both the beginning and the end of the Culural Revolution; and, for the three spline functions, that there is a curvilinear relationship between year and knowledge of characters during the Cultural Revolution period.

l6S

QuantitativeData Analysis:Doing SocialResearch to Testldeas

''

'inla

Ra^r.rr

':,.:l:

: ,'.l' Goodness-of-Fit statistics for Models of Knowledge of chinese Characters by year of Birth, Controlling for years of schooling, with Various Specifications of the Effect of the Cultural Revolution (Those Affected by the Cultural Revolution Are Defined as people Turning Age il During the Period 1966 through 19771,Chinese Adutts Age 20 to O9 in 1996 (N = 6,08G),

: 'Chinese Char a( lues in Paren ---Va

':=-i o: schocl;:l: .665 .616

i 956-'196: .g

i:i6725.9

-6723.9

.612

- 6722.1

.611

:: -

--

-6724.1

. 2A 71.72

-6717.4

1116.33 :--.

---..1,i1,/:-

'::(

-€ar 1r€tc '-f5

-. ':::

4.26 - 42.4

:a Ba . a .=' .

30.04 ::

54.43

.003

s1.11

1.8

.00'l

. 6.86

6.5

.000

'a a - a _ e a - :t

-

.

: :;l _1i Lrn i :

:::

' .-

- i ddl l i L-rr:

:

. : - , t t r ing iit : a. - :-:-rruities-.. : , , - . likelr r : . :

' - t iple Re g re s s i olnri c k s :T e c h n i q u efo s r H andl i ngS peci aA na yti c P robl errs

] 5l

' , :, Coefficients for Models 4, 5. and 7 Predicting Knowledge :- Chinese Characters by Year of Birth, Controlling for Years of Schooling :-Va lues in Parentheses).

:: 's of schooling

- i 955 or earlier(age11 1965or earler)

:

:--

1956-1967(age11 1966*1977)

: r - - 1968or l a te r(a g e1 1 1 9 7 8o r l a te r)

: - i: q inu: t ya t ' 9 5 5

Model4

Model 5

.443 (.000)

.443 (.000)

A44 (.000)

0.001 \.721)

0.001 (.134\

0.001 (.749)

0.043 (0.000).

0.032 (0.000)

0.041 (.000)

0.016

-0.557 ( 000)

*0.508. (.000)

. -o.o4l (0.18s) 0.028 (.012) -0.349. / nnl\

o.241 (.010)

, : : r nt inuit ya t 1 9 6 7

0.0066 (.00e)

:--,llineartend'195ffi7

= : (rootmeansquareerror)

Model 7

0.770

0.770

0.771

0.571

0.672

o.672

1.29

1.29

1.29

. ,rnparison of the B.lCs suggeststhat three models-my hypothesized model, a model ,: in addition to a discontinuity at the beginning of the Cultural Revolution allows the =:J during the Cultural Revolution period to be curvilinear. and a model positing - .:ontinuities at both the beginning and the end of the Cultural Revolution are about , ..i1ly likely given the data, albeit with weak evidence favoring the single-knot model. - : that all three are strongly to be preferred over all other models.

162

QuantitativeData Analysis:Doing SocialResearch to Testldeas

Again, B1Cand classicalinferenceyield conrradictoryresultsbecausethe two alternativemodelsfit significantlybetter(at rhe 0.01 le\el) than doesthe originally hypothesizedmodel.Here I am in a bit of a quandaryas to u hich modelto prefer.I havealreadr stateda basisfor positing a single discontinuity.plus a knot at the end of the Cultura Revolution.However,anotheranalyst might favor a two-discontinuitymodel, on th; groundthat the curricularreform in 1977that restoredthe primacyof academicsubjecc was radical enoughto posit a discontinuityat the end as well as at the beginningof the Cultural Revolution.A third analystmight arguethat a linear specificationof trends. especiallyin times of great social disruption,is too restrictiveand that it makesmore senseto posit a curvilinear effect of time during the Cultural Revolutionperiod. Ir Treiman(2007a),I presentedthe model positinga discontinuityat 1955,a knot at 196-. and a curvebetween1955and 1967-see Figure7.4 in that paper.Howevel the truth i! thatthereis no clearbasisfor preferringany oneof the three,exceptfor the evidenceprc' vided by BlC, which suggeststhat ihe originally hypothesizedmodel is slightly mor; likely thanthe othersgiventhe data.Again, my suggestionis, go with theory.If you har: a theoreticalbasisfor one specificationover the others,that is lhe one to feature;but. iI the sametime, you mustbe honestaboutthe fact that alternativespecificationsfit nearll equally well. In fact, the optimal approachis to presentall threemodelsand invite th: readerto chooseamongthem.A waming: if you do this, you probablywill haveto figL with journal editors, who are always trying to get authors to reduce the length of papersand perhapswith reviewers,who sometimesseemto want definitive conclusionsere: when the evidenceis ambiguous. The estimatedcoefficientsfor all threemodelsare shownin Table7.4.In alt thr*modelseachadditionalyearof schoolingresultsin nearlyhalf a point improvementin dE numberof charactersidentified.However,the coefficientsassociatedwith trendsortime are relativelydifficult to interyrer.Again, this is an instancein which graphingrtr relationshiphelps.Figure7.9 shows,for eachof the threepreferredmodels,the predicr* numberof charactersrecognizedfor peoplewith twelve yearsof schooling,that is, *bi havecompletedhigh school.Although the threegraphsappearto be quite different,th1 all show a declineof abouthalf a point in the numberof charactersidentifiedfor thos who wereageelevenduring the early yearsof the CulturalRevolutionperiod,relatile :: thosewith the samelevel of schoolingwho tumed elevenbeforeand after the Cuhwir Revolution.Thus, despitethe difficulty in choosing among alternativespecificatio*togetherthey stronglysuggestthat the quality of educationdeclinedduring the Culruii, Revolution.Peoplewho acquiredtheir middle school(unior high school)educationdr:ing the CulturalRevolution,in effect,lost a year of schooling-that is, displayedknos _edgeof vocabularyequivalentto thosewith one year lessschoolingwho wereeducai= before and after the Cultural Revolution. Still, we shouldbe cautiousin our interpretationof Figure 7.9, where the Culru:rr Revolutioneffect appearsto be quite large becauseof the way the data are graphi; (with the y-axis rangingfrom 5.3 to 6.7 charactersrecognized).Indeed,Figure 7.10_r which the y-axis rangesfrom 0 to 10, suggestsa ratherdiffereDtstory'-a very mod:s decline in the numberof charactersrecognized.It is quire rea_.onable ro reporr li,su::: suchas Figure7.9 to makethe differencesamonsthe model [|i

lmperialdegreeholdet (xiucai,juren)

30.5

[* il

Othe'

39.0

!t

Total

28.5

G

2,413

1$, qt li I

lui trl ml

e-bility, analysesincludingsuchvariablesoftencanbe misleading.A way of correcting trs lroblem, whenmeasures of reliability areavailable,is to correctconelationsfor attenrar..,ncausedby unreliability.The Statacommand- eivreg - (enors in-variablesregres'L-l.- doesthis conveniently.The analystsuppliesan estimateof the reliability of each ,rr-"ble,andthe commandmakestheadjustmentandcaniesout the regre5\ione\timation. li ro estimateis supplied, the variable is assumedto be measureduith perfect e::lility.)

260

to Testldeas DoingSocialResearch DataAnalysis: Quantitative

it canhave,I herepreserr To showhow this procedureworks andwhat consequences an analysisof the effectof abortionattitudesandreligiosity(the threescalescreatedpreviously) plus race,region of residence,an interactionbetweenrace and region, and the narurallog of incomeon politicaIconservatism. From the previousanalysiswe havethe reliability of the threescales.I takethe rel' ability of the income measute,.8, from Jencksand others (1979, Table A2.13) anr assumethat race and region of residenceare measuredwithout error.Table 11.7 shosi ari the resultsof OLS estimationwithout correctingfor unreliability of measurement, errors-in-variablesestimationthat does correct for unreliability. Because-eivregdoesnot permit correctionof the standarderrorsfor clusteringandrequiresaweightst* regressionwith the!. fweights),I carriedout both conventionaland errors-in-variables specifications.

fA,iLf 11"7. coefficients ofa Model of the Determinantsof Political Conservatism Estimated by Conventional OLSand Errors'in-Variables Regression,U.5, Adults, 1984 (N = 1,294). ConventionalOLS

s.e.

p

0.692

0.170

.000

-0.282

0.091

.OO2

b ReliEosity preference Personal aDonron

o

Errors-in-Variables s.e. P

1.066

o.2s7

0oOi

-0.220

0.113

.051

I

.816

0.172

o.425 DI

0.063

0.079

s .e .e .

1.24

1.23

I i

Scale Construction 261 The effect of adjustingfor differential reliability is dramatic-the coefficient associated mrt:eligiosity increasesby 54 percent.In addition, the coefficientsassociatedwith rc3?nce of therapeuticabortionand with incomeincreaseslightly,and the coefficient urn:cated with acceptanceof abortion for personalpreferencereasonsdecreasesslightly. ]^,"llGresultsindicate clearly how the relative effects of variablesin a multiple regression u re distorted if the variables are measuredwith differential reliability, as these are. Be::]l that the reliabilities of the religiosity, therapeuticabortion, income, and personal are,respectively, .66,.78,.80,and .93.) mr=.enceabortionmeasures \ote that with one exception,all the coefficientsare aboutwhat we would expect: increaseswith religiosity,with income,and for non-Blacks,with tuu:;ai conserrr'atism (although lruiem residence this last effectis only marginallysignificant)anddecreases ru,.ic:eptance of both kinds of abortionincrease.The unexpectedresultis that acceptance r ::erapeuticabortion is a much strongerpredictor of political conservatismthan is lrc:::tance of abortionfor personalpreferencereasons.From the analysisI presented rr-:.bney 1999,7-8)i and as noted,8/C is not available rre optimal solution may be to treat clusteredsamplesin a multilevelcontext, estimating er*rerfixed- or random-effectsmodels(Mason2001),which can be done in Statausingthe go beyondwhat can be coveredin this book, -:{t- or -gee- command;theseprocedures to Althoughevennow much 3(n seeChapterSixteenfor a briefintroduction multilevelanalysis. . -':ra:-_--l-journals, and treats simplyignorescomplexsanpledesrgns rat is published, evenin ldading this is generallyinappropriate cata as if they were generatedby randomsamplingprocedures, in its variForthe oresent,lsuqqestfor loqisticreqression 4d can leadto incorredinferences.

:us formsthatwhenyouhavedatathatareweightedorclustqedyoucarryouty!u!-estimatigl '.-:-relvon adtustedWald tettilor modelselection. +:---

iindit-tata3 survevestimalioncommandsand + Onlywhere 3e cautioui,however,in your interpretationand exploreallernativespecifications. you usethe - logistic - commandand random sample should have a true, unweighted, -,ou ikelihoodratio test (-lrtest) . Further,wheneverpossible,eschewweightingin favor of rxluding the variablesusedto createthe weightsin the model.

J

niquesfor makingthe generalshapeof a distribution clearby removng " no ," " -d"ui"tions $ from the underlyingtrend that resultfrom samplingerroror id osyncratic factors.Perhapsthe simplest smoothertsa movingaverage. A movingaverageis the average valueof several consecutlve data points.Considerthe workedexamplein this secton. A three-year moving averageof the expectedprobability of marriageat eachage would be constructed by first takingthe averageof the expectedprobablitlesfor agesfifteen,sixteen,and seventeen; thenthe average of the expectedprobabilities for agess xteen,seventeen, and eighteen; and so on. At the time the age-at-first-marriage examplewas created,the Statasubcommand -ma- ("movingaverage")was available within the -egen command.However, this sub(although commandisno longerdocumented in Stata10 it stillworks),andhasbeenreplaced by smooth . whichgenerates mediansof the lncludedpointsratherthan means.Another smootheravailable in Statais -lowess- .

il .tt

a'n-Blacks (precisely,0.591 : 0.190*3.108).Among 3O-year-old never-married people, fu oddsof marryingin that year amongthosewhosemothersare collegegraduatesare r:rlv 10percenthigherthan the oddsfor thoseof the sameraceand sexwhosemothers (precisely, r: highschoolgraduates 1.094: (0.918*1.114)4). Despitethe usefulnessof Table13.6for making specificcontrasts,the overallpattem qlied by the coefficientsis difficult to discem.Again, graphshelp. Figures 13.4 and of the expectedprobabilityof first marriageby age --:-i showthree-yearmovingaverages f, isk. separatelyfor Blacksandnon-Blacks.In eachgraph,separatelines are shownfor tri.r-esand femaleswhosemothershad twelve and sixteenyearsof schooling(as a con|€=ientway of visually representingthe effect of mother'seducation).Moving averages r: shownbecausethereis a greatdealof "float andbounce"for individualyears,which F lident from inspectionof the coefficientsin Table 13.6.(Seethe downloadablefile *:13_2.do" for details how on the moving averageswereconstructed.) InspectingFigures 13.4 and 13.5, we see that mariage rates for Blacks differ mnntially from thosefor non-Blacks,with Blacks much less likely than non-Blacks @:lirry at all. Moreover,non-Blackfemales(especiallythosewhosemothershaveonly ,I rsi schooleducation)marry at disproportionately high ratesat agesnineteenthrough lDdxn -five; non-Black males marry a bit later and with less concentrationin a short FL{. Black marriagerates,by contrast,are spreadout over a much longerperiod,but rrE ar upsurgein marriageratesfor malesin their thirties,especiallythosewhosemothG ire high schooleducated.For both Blacks andnon-Blacks,malestend to marry later k remales,with male ratesexceedlngthoseof femalesbeginningaroundage thirty. lirrJl. amongall race-by-sexgroups,thosewhosemothersarehigh schoolgraduatesare m3 likely to marry than are thosewhosemothersare collegegraduates. Ir I werepreparingtheseresultsfor publication,I would presentonly a subsetof the fter large set of tablesand graphswe havejust marchedthrough.The intent here,of

to Testldeas QuantitativeDataAnalysis:DoingsocialResearch

326

. 18 . 16

F e m a l e(s1 2 ) -.----o- Males(12) F e m a l e(s1 6 ) --_ Males(.16)

\

,/ . 14

6

i .os E p .u o

.04 .02 0 15

1/

21

19

23

25

27

29

31

33

35

Age at nsk

PtGtJ*i: 13'4, r"pecteaProbabititvof Marrvingfor the FiRt Timebv Ase Sex,andMother'sEducation(Twelveand sixteenYearsof Schooling)' at Risk, U.S.Adults,1994. Non-Black

. 18 . 16 . 14

(12) Fem ales - o.---o- Males(12) (16) Fem ales --Mates(16) -

E b

9 .oe € .o o

rr.r-.-Q,

.04 .02 0 19

21

23

25

21

29

31

Age at nsK

of Marryingfor theFirstfimebyAge Fl€URg 13'$. etpect"aProbabitity at Risk,Sex,andMother'sEducation(TwelveandSixteenYea'sof Schooling)' BlackU.s.Adults,1994.

;d nl !t

BinomialLogisticRegression 327 is to providealtemativesfor you to considerwhenpresentingyour own analyses. of the application of discrete-time hazard-ratemodels include Astone and oth1J00),Dawson(2000),Lewis andOppenheimer(2000),and Sweeney(2002).

FOURTHWORKEDEXAMPLE(CASE-CONTROL MODELS):WHO APPOINTED TO A NOMENKLATURA POSITIONIN RUSSIA? a dependentvariableis a rareevent,it is inefficientto draw a representative sample populationat risk for the event,becausethe samplesizewould haveto be extremely to obtainenough"positive"casesto analyze.This is a frequentoccurrencein epideical research,where the eventsof interestare diseases,but it also occursin the :ciences.For example,if we are interestedin studyingwhat determineswho gets to Congress,we could hardly do this by drawing a representative sampleof the ion andlooking for the congressmen in it. We havesimilar problemsin studvins crime victimization,homosexuality,and variousotherrelativelyuncommonpheOne solutionto this problemis to sampleon the dependentvariable(that is, to a sampleof congressmen, criminals,or homosexuals), collect informationon that collect oie. correspondinginformationon a representative sampleof the population 'lrs not experiencedthe rareevent(becomingcongressmen,criminals,or homosexuals), the two samples,and model the odds of experiencingthe rare event.This is ascase-controlsampllngin the epidemiologicalliterature(for an excellentreview itatisticalproceduresinvolved,seeBreslow [1996]). C3-ie-controlsampling exploits the fact that odds ratios are invariant under shifts distributionof the data.This extremelyimportantfeatureof oddsratios makesit to combine sampleswith very different distributionson the independentand variablein orderto modelrareevents.This capabilityis not possiblewith OLS becauseOLS coefficientsare affectedby the distributionsof the variablesin n]del. T.r see how case control procedures work in practice, let us consider what factors

the oddsof becominga memberof the Russianpolitical elite at the end of the ist era. From Social Stratification in Eostern Europe after 1989 (Treiman and samplesfrom Russia:a probabilitysampleof 1i 1993),we havetwo representative ,< population(N : 5,002)and a randomsampleof personswho werein nomenpositionsasof January1988(N = 850).(SeeAppendixA for a descriptionof the .md informationon how to obtain them.)Nomenklaturapositionswere thosethat the approvalof the CentralCommitteeof the Communistparty. They ranged rery high govemmentofficials (for example,membersof the politburo) down to of sensitiveorganizations-for example,rectorsof universities,editorsin chief of newspapers, andheadsof largeindustrialenterprises. Th generalpopulation sample departsin two ways from compliancewith the ions underlyingcase-controlsampling,but neitherdeviationis importantfrom standpoint.First, it is a probability sampleof the 1993 populationrather tb 1988population.However,the samplingframe is basedon the lg89 census,and nmple thereforeprobablyrepresentsthe 1988populationnearly as well as it does

BinomialLogisticRegression

II

329

Before tuming to interpretation of the results, we should note the one difference hween case-controlanalysisand ordinarybinomial logistic regression:in case-control aalysis the intercept is not meaningful. This should be obvious from the fact that the in logistic regressionindicates the proportion of the sample that is "positive" rid respectto the dependentvariable. However, in case-controldesignsthis proportion -ercept b ixed by the sampledesign, and thus the coefficient addsno information. Inspectingthe coefficientsin Table13.7,we seevery largeeffectsandfew surprises. Ech year of schoolingincreasesthe odds of becominga memberof the nomenklatura b more than 70 percent. Thus, all else equal, university graduates(who typically have Li yearsof schoolingin Russia)are more than 15 times as likely as high schoolgradu(with 10 yearsof schooling)to be appointedto nomenklaturapositions(precisely, -s l5-i2 : 1.72605r0)).The effect of genderis astronomical:malesare more than 17 times G likely as females to be appointed to nomenklatura posts. The effect of age is also anemely strong: all else equal, the odds of being appointed to a nomenklatura posrtion i;rease about 14 percentper year.Thus, for example,a SO-year-oldis more than 7 times hkely to securea nomenkhtura positionasis a 35-year-old(precisely,7.23 : 1.141(50-35)). -Itrhaps more interesting, the effect of social origins, evenamong thoseequally well educred, is far from trivial. Coming from a family in which one's father was a memberof the Communist Party improves one's chancesof a nomenHntura appointmentby about half, d elseequal.Also, eachyear of father'sschoolingincreasesthe oddsof nomenklatur.l qpointment by about 11 percent-this in the worker's paradise!-so that the offspring of t university-educatedintelligentsia (15 years of school) are about three times as likely * the offspring of those with only a primary educationto sectJrenomenklatura apporntlmts, inespectiveof their own educationalachievement(precisely,294 : 1.114(s5)). rllone amongthe variableswe haveconsidered,father's occupationalstatushasno impact r the odds of appointmentto a nomenklatura post.

XHAT THISCHAPTER HASSHOWN h dis chapterwe have seenhow to estimateand interpret binary logistic regressionmodds- which are widely usedto model dichotomousoutcomessuch as whether people vote, employed, or are members of a particular organization. We have seenthat although t- estimationproceduresare quite different, the interpretation of the coefflcients of such ndels is similar to that of OLS regression, except that the coefficients represent net &cts of eachindependentvariable on the log odds of an outcome. Because log odds are not intuitive quantities, we have considered two nonlinear :nsformations to more readily interpretable coefficients----oddsand expected problllities-and have also seenhow to graph net relationships, a form of regressionstanfor logistic regression.Finally, we consideredthree extensionsof the basic &ization models, listic regressionmodel:educationprogressionratios,discrete-timehazard-rate d case-controlmodels.A notable feature of logistic regressionmodels is that they are with respectto the distributions of variables in the sample,which is what makes procedureslegitimate in the logistic regressioncontext blrt not in the OLS Ge{ontrol -aiant rlresslon conlexr.

330

Quantitative DataAnalysis: DoingSocialResearch to Testldeas

APPENDIX I3,A SOM: AIGIBRA FORLOGSAND EXPONENTS who have forgotten their school algebra, here are some usefirl Io. 9lr" rnvorvlngnaturallogarithmsandantilogs(exponents): e'lt) : X

h(x*r):h(x)+h(r) ln(X /I) : tn(X) - ln(f) X * Y : et^6) etn(Y): e(L(x)+ln(Y) e :'essionalin 1988

0.9943 (.1s48) .000

1.124 (.2990) .000

1.3856 (.3s77) .000

- 5.5378 (.3021) .000

-8.I 541 (.5036) .000

- 10.1965 (.5866) .000

!: ::1

t : :r

y'ariable -: r ts (b)

:.___ :. _

:tL

::J

: , .'; :

: {li

. :-. -': -- l-

:

: r* :

-'= l

:'' -u :::-_.---:

-:

339

--J.-'

-28.3602 (.7039) .000

:u ':L-:il

::: multipliers(d) '::.s of school ::-oleted

1.248

1.363

1.508

: :- a Communtst Party -:-ber?

1.353

0.408

i.050

\Cantinued)

340

QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas

Y&PLi:

1,6, t , ef."t parametersfor a Modetof the Determinants of

English and Russian Language Competence in the Czech Repubti(, 1993 (N : 3,945). (Standard Errors in parentheses; p_values in ttaiic.) (Continued) Variable

Russian

Othermanagerin l98g

2.702

English

2.38

I about a thfudhigher than the odds that they spokeneither language, whereasthe oddsdlx Communist Party membersspoke English but not Russian i" Jniy uOout40 percenrr great as the odds that they spoke neither language.Thus the odds that Communist Fa.l membersspokeRussianbut not English are more than thre" ti*", u, gr.ui u. tt;-;il ; they spokeEnglish but not Russian(becauseecozor-.txr)) : 1.35410.40g = 3.316).Th sameis trxe of service as a govemment or Cornmunist party official. Here, as expecr.;, officials were nearly five times as likely to speakRussian urt to speakingneitba 1ln Russian nor English) than were those who were neither "ont managers nor proibssionalsl: technicians(recall that the referencecategoryis all other occ-upatrons). The odds d;r g?YeyTrent officials spoke English or both Russian and English are effectively zero_ which they should be becausenot one of the sixteenofflcials ii the samplespokeEnelin Fin{1y, yi seethat being a professionalor technicianin 19gg roughly triples the od6 d speakingRussianonly or English only, andqladruples the oOAs of spJakingbottrEnglishmd Rl^ssian,relative to speakingneither English no. Russian. By coniast, being a managern triples ttre odds of speakingRussianonly, relative to speaking neither.Bur c l3!9- ":t,l erect or bernga manageron the oddsof speakingEnglish or of speakingboth English mc Russianwereboth somewhatsmallerthanttte effectof 6eing a on tt oddsof spez&ing Russian.A1so,the coefficientsareonly marginally signi-ficant_alt " 0.1 --ig", aboutthe level. Althoughfor this exampleI settledon a singlemo-delin advance, model selectionfir . multinomial logit modelsis carried out in exactly the same way as fbr binomial lcrs_ models-by taking the ratio of the differencein Z;s (Modef XrO'to tfr" Oiff"..;;;;; de^grees_of freedom for any two models,to determinewhether one model fits the data srcnilicantly betterthan the othermodel (but recall that this p."""d";ir;;;;;;;;;; robustestimationis used-that is, whenthe dataareweighted or clustered;rather,a \\-ais testshouldbe usedto comparemodels).

lndependenceof IrrelevantAlternatives In the_multinomial logit model, the relative odds of being in two categonesare assumedr be independentof the other altemarivesincluded in the riodel. This fJllows from Equari.r 14.1,flom which we canderivethedifferencein log odds for two categories, d andc, a.

Multinomial and Ordinal LogisticRegression and Tobit Regressio n

.'LuurJ ''[##J:1"*2u,"r) 1",

Bot B E:

i_1!E

'-'t be --ia :rt rr [ -::-r fmr nr-< 6.

:aL:

nt

-E - - - _-,f 3: 3L:P-JdtL-:-::€liF

fe.:: --:,:,. n Tx be s:! n3-l .-:nk: E:-::* s c-r. .\i'.5 ir h Er::-:-: a a T;i;::

I

i6er. B:: @ r Ergr-: m dL. o: :-n0,1ie'.::r' --le:-::: ino mr:.:'s iete;i;e :: :it ; de ia- q. f,rs'ible tiE' 121i131-3 \\ arD

re LisuiDei

II

ion Equ::.:r ..imdi. =

341

(14.2)

\.rte that only the two categoriesbeing comparedenterthe equation.If, however,the rela::;e odds do depend on what the altematives are, the model produces misleading srimates.To seelhis clearly,considerMcFadden's(1974) well-knownexampleof transa-rtation choice. Supposepeoplecan travel to work by bus or by car and that half choose -t go by car and half by bus.Now supposea competingbus companyestablishes buses r:ih the sameroutesandschedule,so we no longerhave,say,only blue busesbut alsored r.es. Presumably,the half that traveled by car would continue to do so, but the half that :-.r'eled by bus would divide equally between the red and blue buses,taking whichever ri showed up first at the bus stop. Thus the odds ratio for car versus blue-bus riership would changefrom i:1 to 2:1, violatingthe assumptionof the model. Now consideranotherexample.Supposetherearetwo restaurantsin a neighborhood,a ![erican andan Italian restaurant,andthat the Mexican restaurantgets60 percentof the total r-.iness. Then a new Chineserestaurantopensin the neighborhoodanddrawsoff 20 percent :idre businessof the Mexican restauant and20 percentofthe businessofthe Italian restau:::]r The Mexicanrestaurant'sshareof thetotal is now 48 percent,andthe Italian restaurant's (trA) ;;re of the total is 32 percent. Here the independence-of-irrelevant-altematives rsrmption holdsbecause60/40 : 48/32 : 312. Becausethe multinomial model is misleadingwhen the IIA assumptionis violated, \(;Fadden suggeststhat multinornial(andconditional)logisticregressionmodelsshould :E estimatedonly when the outcomecategories"can plausiblybe assumedto be distinct md weighedindependentlyin the eyesofeach decisionmaker" (1974,I13). A formal testofthe IIA propertyis available,implementedin Stata10.0as suest-r-emingly unrelatedestimation,"a generalization of an earliercommand,-hausman-). la€ -suest- test comparesmodelsthat do and do not include presumablyirrelevant :qicomes.If the resultingparametersfor the restrictedanduffestrictedmodelsare simi-::- the additionaloutcomescan be assumedto be irrelevant.Applying theseideasto our ::rrent example,we might ask whetherthe oddsthat peoplespeakEnglish are affected f. including "Russian" as an alternativein the model. In this case the test strongly ;.sgests that the IIA condition is not satisfied.Thus we might considerestimating r,equential logit model in which we successivelyconsidertwo {uestions:whethera =spondentspeakseither Russianor English versusspeakingneitherlanguage,and for :L'h of the two subsetsof respondents-thosespeakingRussianand those speaking 1:,glish-whether they speakthe otherlanguageaswell. For fulher discussionof the IIA assumptionand its consequences. seeMcFadden (1988). (1984), Hoffman Hausman and McFadden and Duncan Zhang and -97.1), (1993), (1997,182-184), (2000. Long Powers and Xie 215 247). Long and -;:frman (2007). (2006), -suesti=ese andthe -hausman- and entriesin Statacorp Addi:rroal examplesof the applicationof multinomial logit modelsincludeAIl and Shields (1999t.and Breen and Skag-es ,991),Haynesand Jacobs(1994),Tomaskovic-Devey (2000). rcd Jonsson

342

DataAnalysis: Quantitative DoingSocialResearch to Testldeas

ORDINATTOGISTIC REGRESSTON Often in the social scienceswe haveordinal dependentvariables,wherethe response ries canbe orderedon somedimensionbut wherethedistancebetweencateeorieiis ur Most attitude variablesare of this sort. For example,if people are askedto say how hfl lhey are, and the responsecategoriesinclude ,.veryhappy,',,,prettyhappy,,'and .,ncr: happy,"there is no ambiguity in assumingthat those who say they are ,.pretty happllesshappythan thosewho saythey are'\;ery happy',andaremorehappythan thoseu.bc, they are "not too happy."However,thereis no basisfor assumingthat the distance "not too happy" and "pretty happy" is the sameasthe distancebetween..prettyhapp\,'1ery happy." Many other aftinrde scaleshave similar properties.In such caseswe o predict the scalescoreusing ordinary least-squares regression.However,to do so wouk tantamountto assumingthat the distancebetweenresponsecategoriesis uniform. (For a ful discussionof this andotherpoints, seeWinship andMare [1984].) An altemativeis to estlmatean ordinal logit eqtJation,which makesuseof the property of the responsecategorieson the dependentvariable but makesno at all abouttherelativedistancesbetweencategories. The basicassumptionof the ord logit model is that thereis an unobservedcontinuousdependentvariable,f*. whicb linearfunctionof a setof independentvariables: Y* :

al

Db jx j + p

However,what is observedis a setof orderedcategories,y : 1 . .. { suchthat Y:Iif-cn3Y*1kr -Z rf kt