150 51 5MB
English Pages 392 [408] Year 2016
Mathematics for Social Scientists
SAGE was founded in 1965 by Sara Miller McCune to support the dissemination of usable knowledge by publishing innovative and high-quality research and teaching content. Today, we publish more than 850 journals, including those of more than 300 learned societies, more than 800 new books per year, and a growing range of library products including archives, data, case studies, reports, and video. SAGE remains majority-owned by our founder, and after Sara’s lifetime will become owned by a charitable trust that secures our continued independence. L o s A n g e l e s | L o n d o n | N ew D e l h i | S i n g a p o r e | Wa s h i n g to n D C
Mathematics for Social Scientists Jonathan Kropko University of Virginia
FOR INFORMATION:
Copyright © 2016 by SAGE Publications, Inc.
SAGE Publications, Inc. 2455 Teller Road Thousand Oaks, California 91320 E-mail: [email protected]
All rights reserved. No part of this book may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without permission in writing from the publisher.
SAGE Publications Ltd. 1 Oliver’s Yard 55 City Road London EC1Y 1SP United Kingdom SAGE Publications India Pvt. Ltd. B 1/I 1 Mohan Cooperative Industrial Area Mathura Road, New Delhi 110 044 India
Printed in the United States of America ISBN 978-1-5063-0421-2
SAGE Publications Asia-Pacific Pte. Ltd. 3 Church Street #10-04 Samsung Hub Singapore 049483
Acquisitions Editor: Helen Salmon Editorial Assistant: Anna Villarruel eLearning Editor: Katie Bierach Production Editor: Kelly DeRosa Copy Editor: QuADS Prepress Pvt Ltd Typesetter: QuADS Prepress Pvt Ltd Proofreader: Scott Oney Indexer: Jeanne Busemeyer Cover Designer: Anupama Krishnan Marketing Manager: Nicole Elliott
This book is printed on acid-free paper.
15 16 17 18 19 10 9 8 7 6 5 4 3 2 1
CONTENTS
Acknowledgments About the Author Introduction
I Algebra, Precalculus, and Probability 1. Algebra Review 1.1 Numbers 1.2 Fractions 1.2.1 Addition and Subtraction 1.2.2 Multiplication 1.2.3 Division 1.3 Exponents 1.4 Roots 1.5 Logarithms 1.6 Summations and Products 1.7 Solving Equations and Inequalities 1.7.1 Isolating a Variable 1.7.2 Distribution and Factoring 1.7.3 Solving Quadratic Equations 1.7.4 Solving Inequalities Exercises 2. Sets and Functions 2.1 Set Notation 2.2 Intervals 2.3 Venn Diagrams 2.4 Functions 2.4.1 Function Compositions and Inverses 2.4.2 Graphs 2.4.3 Domain and Range 2.5 Polynomials 2.5.1 Linear Functions and Linear Graphs 2.5.2 Higher-Order Polynomials 2.5.3 Linear Regression Exercises 3. Probability 3.1 Events and Sample Spaces
ix xi xiii
1 3 3 6 6 8 9 10 11 14 17 25 25 27 31 32 35 41 41 45 48 51 52 55 56 58 59 61 63 72 77 77 v
vi 3.2
Properties of Probability Functions 3.2.1 Equally Likely Outcomes 3.2.2 Unions of Events 3.2.3 Independent Events 3.2.4 Complement Events 3.3 Counting Theory 3.3.1 Multiplication 3.3.2 Factorials 3.3.3 Combinations and Permutations 3.4 Sampling Problems 3.4.1 Sampling Without Replacement 3.4.2 Sampling With Replacement 3.5 Conditional Probability 3.6 Bayes’ Rule Exercises
78 79 81 83 85 87 87 87 88 90 90 91 92 95 99
II Calculus
109
4. Limits and Derivatives 4.1 What Is a Limit? 4.2 Continuity and Asymptotes 4.3 Solving Limits 4.4 The Number e 4.5 Point Estimates and Comparative Statics 4.6 Definitions of the Derivative 4.7 Notation 4.8 Shortcuts for Finding Derivatives 4.9 The Chain Rule Exercises 5. Optimization 5.1 Terminology 5.2 Finding Maxima and Minima 5.3 The Newton-Raphson Method Exercises 6. Integration 6.1 Informal Definitions of an Integral 6.2 Riemann Sums 6.3 Integral Notation 6.4 Solving Integrals 6.4.1 Solving Indefinite Integrals 6.4.2 Solving Definite Integrals 6.5 Advanced Techniques for Solving Integrals 6.5.1 u-Substitution 6.5.2 Integration by Parts 6.5.3 Improper Integrals
111 111 117 119 122 123 124 129 131 134 141 147 148 151 154 157 161 161 163 168 170 170 173 176 176 178 180
vii 6.6 Probability Density Functions 6.7 Moments Exercises 7. Multivariate Calculus 7.1 Multivariate Functions 7.2 Multivariate Limits 7.3 Partial Derivatives 7.3.1 Definition and Notation 7.3.2 Gradients and Hessians 7.3.3 Optimization 7.3.4 Finding the Best-Fit Line for Linear Regression 7.3.5 Lagrange Multipliers 7.4 Multiple Integrals 7.4.1 Notation 7.4.2 Solving Multiple, Definite Integrals 7.4.3 Solving Multiple, Indefinite Integrals 7.4.4 Joint Probability Distributions and Moments Exercises
182 188 197 205 206 208 212 212 218 222 226 233 237 237 241 244 245 262
III Linear Algebra
269
8. Matrix Notation and Arithmetic 8.1 Matrix Notation 8.2 Types of Matrices 8.3 Matrix Arithmetic 8.3.1 Transpose 8.3.2 Trace 8.3.3 Addition and Subtraction 8.3.4 Scalar Multiplication 8.3.5 Kronecker Product 8.3.6 Vector Multiplication 8.4 Matrix Multiplication 8.4.1 Checking Conformability 8.4.2 Computing the Product 8.5 Geometric Representation of Vectors and Transformation Matrices 8.6 Elementary Row and Column Operations Exercises 9. Matrix Inverses, Singularity, and Rank 9.1 Inverse of a (2 × 2) Matrix 9.2 Inverse of a Larger Square Matrix 9.2.1 The Adjoint Matrix 9.2.2 Determinants 9.3 Multiple Regression and the Ordinary Least Squares Estimator 9.4 Singularity, Rank, and Linear Dependency 9.4.1 Singularity
271 271 273 276 276 276 277 277 278 280 281 282 283 285 295 297 303 303 305 305 308 311 318 318
viii 9.4.2 Linear Dependency 9.4.3 Rank Exercises 10. Linear Systems of Equations and Eigenvalues 10.1 Nonsingular Coefficient Matrices 10.1.1 Solving by Taking a Matrix Inverse 10.1.2 Solving by Using Elementary Row Operations 10.2 Singular Coefficient Matrices 10.2.1 Systems With No Solution 10.2.2 Systems With Infinitely Many Solutions 10.3 Homogeneous Systems 10.4 Eigenvalues and Eigenvectors 10.4.1 Finding Eigenvalues 10.4.2 Positive-Definite and Negative-Definite Matrices 10.4.3 Finding Eigenvectors 10.5 Statistical Measurement Models 10.5.1 Principal Components Analysis 10.5.2 Correspondence Analysis Exercises
319 320 323 329 330 331 332 335 336 337 342 345 347 350 354 357 360 362 371
Conclusion: Taking the Math With You As You Proceed Through Your Program
379
Index
383
Acknowledgments
This work is a direct result of years of teaching the mathematics review course to first-year graduate students in the political science department at the University of North Carolina. I thank the faculty of the department, especially Stephen Gent and Tom Carsey, for the opportunity to develop this course in my own way. I thank the students in those early courses for being wonderful guinea pigs. This book would not have been possible if it were not for the encouragement of my adviser George Rabinowitz, who tolerated the distraction from my dissertation because he believed as I do in the importance of mathematics for the social sciences. I also thank my colleagues in the politics department at the University of Virginia, who have been very supportive of me as I develop methods seminars and put our graduate students through the mathematical wringer. The wonderful people at SAGE, especially Helen Salmon, Anna Villarruel, and Kelly DeRosa, have shown nothing but enthusiasm for this project. I thank them for their hard work, their support, and their patience. Sucheta Soundarajan has been helping me with my math homework for the last 14 years, and this book is a whole lot better because of her thorough and critical read-through. Ryan North, Zach Weinersmith, and Randall Munroe are three of my favorite comic artists, and I included their comics in this text long before I thought I would publish it, because it was so cool. I thank them for the permission to republish their comics, and I strongly suggest that all readers proceed immediately to Dinosaur Comics (www.qwantz.com), Saturday Morning Breakfast Cereal (www.smbc-comics.com), and XKCD (www.xkcd.com), and start at Comic #1. These comics helped get me through graduate school, and they can do the same for you. Finally, I thank Dan Powers, University of Texas at Austin; Jay Verkuilen, The City University of New York; and the other anonymous SAGE reviewers for providing invaluable feedback on this work.
ix
About the Author
Jonathan Kropko is an assistant professor in the Department of Politics at the University of Virginia, where he also serves on the steering committee of the Quantitative Collaborative, an interdisciplinary research initiative for applied statistics in the social sciences. Previously, he held a postdoctoral research fellowship at the Applied Statistics Center at Columbia University and was a statistics consultant at the H. W. Odum Institute for Research in the Social Sciences at the University of North Carolina. He holds degrees in mathematics (BS) and political science (BA) from Ohio State University, and earned a PhD in political science from the University of North Carolina in 2011. He is a specialist in political methodology, with a focus on missing data imputation, time series, and measurement methods.
xi
Introduction
“
M
program is neither math nor a hard science, so why in the world would I want to take a math class?”
“I’m just no good at math.” “I hate math.” Sound familiar? You are not alone. Almost all students encounter a frustrating math class at some point in their academic career that shakes their confidence and convinces them that the subject is both useless and horrible. One issue is that math is a cumulative subject. Each course begins with the assumption that all of the students have mastered all the material from previous courses, and, of course, that is rarely true. That means that the first time you struggled in a class, for whatever reason, you were in danger of being left behind. Even if you generally earned good grades in high school and college math, chances are that you still harbor anxieties about the subject and its applications in the social sciences. I am going to ask you to make the brave decision to reconsider your opinion about math. My goal in this text is to convince you that math provides a very useful way to think about social science, that math is a language that allows researchers to make very precise statements about theories, and that math is both beautiful and fun. No one will be left behind: We will begin by discussing some foundational ideas, and by the end of the text, we will be working on topics that are well beyond a college calculus sequence. Quantitative methods are a tool—among many tools—for conducting high-quality research in the social sciences, and math is useful because it underlies all of the methods that are used for quantitative research. Knowing math puts you in control when you use quantitative methods. Instead of relying on a computer program with preprogrammed commands, math will give you the insight to understand what these commands do and what information they actually provide. You will be able to alter these commands and create new techniques to meet the particular needs of your research project. When something goes wrong with the quantitative methods you’re using for a research project (and something definitely will), you will have an idea about how to fix the problem. Math, like a language, is used to express ideas. So, in a way, learning math is similar to learning a foreign language, and as a language, math is ideally suited to express ideas with clarity. If you can translate a theory into math, then you will be able to test whether that theory is really supported by our observations of the real world. You will also be able to use logical deduction to derive new theories that are implied by the one you specify. Math is a concept that was discovered, not invented. And the scientists who discovered that the shapes of seashells follow the same mathematical laws as the alignment of the planets in the solar system understood that math is beautiful as well as precise and useful. One thing that xiii
xiv math is not is easy. Math can be very challenging, but in my opinion, solving a challenging math problem is incredibly fun precisely because it isn’t easy. That said, anyone can solve math problems: All that is required is patience and an open mind. This text was written for an audience of students entering graduate programs in the social sciences, although I hope that other audiences will find it illustrative and useful as well. At many universities, undergraduate majors such as political science, sociology, and even psychology require no more than one quantitatively oriented class, which may or may not be math. Given the stigma of math, it’s not surprising that many new social science graduate students haven’t had a math class since high school. The amount of math needed to conduct quantitative research is one of the biggest differences between social science at the undergraduate level and social science at the level of professional academic research, so many academic departments require their new graduate students to participate in a math course during their first semester or during the preceding summer. Students can be alienated and confused by this emphasis. To teach math effectively to social science grad students, the material has to be presented very clearly, and the concepts have to be applied to important techniques in social science research. This text attempts to achieve more clarity by striking a conversational tone, by avoiding the abstracted language of mathematical proofs, and by providing lots of examples. Application, in mathematics, is a tricky word: Texts in applied mathematics are often chiefly concerned with physics and engineering. Application in high school–level texts often refers to obnoxious word problems (Joe can mow the lawn in 1.5 hours, and Steve can mow the lawn in 3 hours. How long will it take them to mow the lawn if they work together?). In this text, the math is applied to topics students will encounter in later courses in quantitative methodology. Many of these applications are built into the exercises at the end of each chapter, so that students can draw the connections between math and advanced methodology while they work through the problems. Given the choice between an intensive 2- or 3-week summer math course for new graduate students and a 1-hour-per-week seminar over the course of one or two semesters, I strongly recommend the later. It’s simply not possible for a student to learn much math from a lecture, no matter how eloquent the lecturer may be. Math must be performed to be learned. The real education occurs after class, as a group of students meet in an empty classroom, pondering a difficult problem, writing on a whiteboard, and explaining concepts to one another. Students must be given enough time to think through the problems. Below I list my recommended schedule for a 15-week semester class that meets once a week for an hour: • Week 1: Types of numbers, fractions, exponents, roots (Sections 1.1–1.4) • Week 2: Logarithms, summations and products, solving equations and inequalities (Sections 1.5–1.7) • Week 3: Set notation, intervals, functions, polynomials (Chapter 2) • Week 4: Events and sample spaces, probability functions, counting theory, sampling problems, conditional probability (Sections 3.1–3.5) • Week 5: Bayes’ rule (Section 3.6)
xv • Week 6: Derivatives (Sections 4.5–4.9) • Week 7: Optimization (Chapter 5) • Week 8: Riemann sums, integral notation, integral definition, solving integrals (Sections 6.1–6.4.2) • Week 9: Advanced techniques for solving integrals, probability density functions, moments (Sections 6.5–6.7) • Week 10: Multivariate functions, derivatives, and optimization (Sections 7.1, 7.3–7.3.3) • Week 11: Multivariate integrals, joint probability density functions and moments (Section 7.4) • Week 12: Multivariate calculus continued, matrix notation, terminology, and arithmetic (Chapter 8) • Week 13: Matrix inverses (Sections 9.1–9.2) • Week 14: Singularity, rank, linear dependence, solving linear systems of equations (Sections 9.4–10.3) • Week 15: Eigenvalues, eigenvectors, statistical measurement models (Sections 10.4–10.5) A companion website at study.sagepub.com/kropko features a solutions guide with the answers to the end-of-chapter exercises so that you can check your understanding as you become familiar with each concept.
This book is dedicated to my Mom and Dad
PART I Algebra, Precalculus, and Probability
Algebraic symbols are used when you do not know what you are talking about. —Philippe Shnoebelen
As far as the laws of mathematics refer to reality, they are not certain, and as far as they are certain, they do not refer to reality. —Albert Einstein
1
Algebra Review
1.1 Numbers Mathematics is first and foremost the study of how to use numbers. Therefore, it is important to know what kinds of numbers researchers use and how these sets of numbers are referred to and denoted. Below is a list of the most common sets of numbers, with their common letter labels and a definition of each set: N—the natural numbers: the counting numbers starting at 1 and going up, { , , ,...,
,
, . . .}.
ω—the whole numbers: the natural numbers together with 0,1 { , , , ,...,
,
, . . .}.
Z—the integers: the natural numbers, their negatives, and zero, {. . . , − , − , − , , , , , . . .}. Q—the rational numbers: any number that can be written as a fraction of integers. , . , and . (which equals ) are all examples of rational numbers. Any decimal with only finitely many digits must be a rational number. Some decimals with infinitely many digits are also rational numbers, for example, . ... = . Q′ —the irrational numbers: any number that cannot be expressed as a fraction of integers. These numbers must have decimals with infinitely many digits and no repeating pattern within the digits. An example of an irrational number is π = . ... R—the real numbers: any number that is either rational or irrational.2 In set notation, we say that the real numbers are the union of the rational and irrational numbers: Q ∪ Q′ R+ and R− —the positive real numbers and the negative real numbers, respectively.
1 Mathematicians
don’t always define the whole numbers separately in this way; many just refer to this set of numbers as the “nonnegative integers” or define the natural numbers to include 0. 2 The numbers that people use in day-to-day life are real numbers. But there are numbers in use that are not real numbers. ∞ is an example of a nonreal number that is discussed here. There are other nonreal numbers that have many applications in fields such as physics and engineering, but this topic is beyond the scope of this book. 3
MATHEMATICS FOR SOCIAL SCIENTISTS
4
An integer is called an even number if it can be divided by and produce a quotient that is still an integer. An integer is called an odd number if adding 1 and dividing by 2 produces an integer. No number is both odd and even. We can use the four fundamental arithmetic operations on real numbers. The four operations are addition, subtraction, multiplication, and division. For simple arithmetic problems that involve several operations, the order in which the operations are performed can change the answer. Mathematics uses a set of conventions, called the order of operations, to standardize the order in which the operations should be performed. Operations in parentheses are performed first, then exponents3 (discussed in detail in Section 1.3), then multiplication and division, and then addition and subtraction. You may be acquainted with the following phrase, abbreviated PEMDAS, which can help you remember the order of operations: “Please Excuse My Dear Aunt Sally” P: parentheses, E: exponents, M: multiplication and D: division, A: addition and S: subtraction.
Figure 1.1
An Alternative Mnemonic Device for the Order of Operations (Excerpt from XKCD comics by Randall Munroe, #992)
Source: xkcd.com/992
3 And
also logarithms, which are presented in Section 1.5.
CHAPTER 1. ALGEBRA REVIEW
5
Multiplication and division can be performed in either order without changing the answer, and addition and subtraction can be reversed without consequence, which is why these operations are listed together in the box above. The order of operations may seem like an easy topic, but people who fail to take the order of operations seriously can be easily misled. Example 1. Solve × ( + ÷ ) ÷
− .
If we were to enter the expression into a nonscientific calculator exactly as written, even if the calculator recognizes to work with parentheses first, we get −0.75, which is the wrong answer. Instead, we use the order of operations, and work by hand. First, we work inside the parentheses, and we must compute division inside the parentheses before addition. So we divide ÷ = , and we reduce the expression to ×( + )÷ − . Now we compute the addition within the parentheses and get × ÷
− .
Now that the parentheses are removed, we can perform either the multiplication or the division and stay on track, but we cannot yet perform the subtraction. Multiplying, we get ÷ and dividing, we obtain
− , − .
Finally, we subtract and obtain the correct answer of 0.
Integers can be broken up into parts. Addends are numbers that add to the number we are considering. So and are addends of , and so are and − . Any integer has infinitely many addends. Usually, it is much more important to consider the factors of an integer. Factors are integrers that multiply to the number we are considering, and any integer has only finitely many factors. A prime number is an integer that has only two factors, itself and 1. , , , , , , . . . are all prime. is not a prime number since × = . Integers that are not prime numbers are called composite numbers. is the only even prime number because its only factors are itself and 1, and all other even numbers have as a factor. There are infinitely many prime numbers. Any number can be broken down into smaller and smaller factors until all the factors are prime. For example, =( ×
)
= ×( ×
)
= × ×( × ) = × × × ( × ). This string of prime numbers multiplied together is called the prime factorization of a number. In this case, × × × × is the prime factorization of . One of the most interesting proofs in basic number theory is that every integer has a unique prime factorization. That is, every integer has a prime factorization, and no two integers share the same prime factorization.
MATHEMATICS FOR SOCIAL SCIENTISTS
6
We use symbols to represent numbers much of the time. Very often, these symbols are English letters, such as such as x, y, and z, and Greek letters, such as α, β, and γ. We use English and Greek letters only as placeholders, and changing the letter does not alter the quantity that the letter represents. Sometimes, however, researchers develop conventions about the use of particular letters in particular settings. x and y are commonly used in algebraic equations, and in the social sciences i often refers to a particular observation in a survey, t refers to a particular point in time, and β refers to a coefficient from a regression analysis. There are two kinds of quantities that can be represented symbolically: constants and variables. Constants are quantities that do not change. Let ζ be the number of months in a year. Then ζ = , and it does not change, so ζ is a constant. π = . . . . is the ratio of the circumference to the diameter of any circle. It is an irrational number, but it is constant. Variables can change. If η is the number of rainy days in a year, then η is a variable because it will change from year to year. A reciprocal of a number is simply the number whose fraction is flipped. Fractions are discussed in more detail in Section 1.2. The reciprocal of is . Any number can be written as a fraction with in the denominator since division by 1 does not change the number. So the reciprocal of any constant c is c . We usually do not try to take the reciprocal of variables because division by 0 is undefined. In other words, if x might be equal to 0, then x might be undefined.
1.2 Fractions Fractions are ways to express division. So to write ÷ =. is the same as writing =. . In the above fraction, the number on top, , is called the numerator and the number on the bottom, , is called the denominator. When the numerator and the denominator share no factors, the fraction is said to be in lowest terms. In general, the simplest way to write a fraction is in lowest terms. To write a fraction in lowest terms, simply divide the top and bottom of the fraction by every common factor greater than 1. Recall that division by 0 is undefined; we can divide something into three parts, or even just one part, but to divide something into zero parts makes no sense. So the denominator of a fraction cannot be 0. Arithmetic becomes a whole lot more complicated when we consider fractions. But algebraic equations show no mercy when it comes to involved fractional arithmetic. You will be expected to be instantly familiar with all the techniques for performing arithmetic on fractions, so let’s review these techniques.
1.2.1 Addition and Subtraction There are two cases of fractional addition and subtraction to consider. The easy case occurs when the two fractions to be added or subtracted have the same denominator. Let a, b, c, and
CHAPTER 1. ALGEBRA REVIEW
7
d be constants where c ̸= and d ̸= . To add (or subtract) two fractions with the same denominator, simply add (or subtract) the numerators and leave the denominator alone: a b a+b + = . c c c Addition and subtraction become much trickier when the two fractions have different denominators. In fact, we can’t do anything until we change the fractions around so that they do have the same denominator. Recall that a number divided by itself is equal to 1. By the same logic, a fraction with the same number in the numerator and denominator is equal to 1. Therefore, for any fraction, we can multiply or divide the top and bottom by the same thing without changing the value of the fraction. To add two fractions with different denominators, we have to find the lowest common denominator (LCD) of the two fractions. The LCD is the lowest number that has both denominators as factors. Example 1. LCD ( , ) = since and than that has both and as factors.
are both factors of
and there is no number lower
For two denominators c and d, the product c × d is always a common denominator, but it isn’t necessarily the lowest common denominator. For and , is a common denominator, but also works and is lower. Sometimes, though, we can do no better than the product. LCD( , ) = × = since there is no number less than for which and are both factors. To see this, try listing multiples of that are less than . We can see that is not a factor of , , , , , or . When the denominators are variables or are constants represented by symbols, we always use the product of the two denominators as the LCD because we have no way to find a lower common denominator. Again, finding the LCD is the first step in adding two fractions with different denominators. To see all the necessary steps, consider the following example: Example 2. Evaluate the sum + .
• Step 1. Find the LCD. Here LCD( , ) = . • Step 2. For each fraction, multiply the numerator and denominator by the number that makes the denominator equal to the LCD. For the first fraction, that number is 3 since × and for the second fraction that number is 2. So, changing the fractions, we obtain × ×
+
× ×
=
+
=
,
.
• Step 3. Add the numerators, and leave the common denominator alone: +
=
.
The fraction has a numerator that is bigger than its denominator, so it is greater than 1. Such fractions are called improper fractions. We could rewrite the improper fraction as a mixed number by observing that = + = .
MATHEMATICS FOR SOCIAL SCIENTISTS
8
So to apply these steps to a symbolic case, consider another example. Example 3. Evaluate the sum a b + . c d • Step 1. Find the LCD. Here LCD(c,d) = cd since we are dealing with symbols. • Step 2. For the first fraction, we multiply the top and bottom by d, and for the second fraction we multiply the top and bottom by c. So, changing the fractions, we obtain a×d b×c ad bc + = + . c×d d×c cd cd
• Step 3. Add the numerators, and leave the common denominator alone. bc ad + bc ad + = . cd cd cd
An important note of caution: We can cancel a term if it is a factor in both the numerator and the denominator of a fraction but not if it is an addend. In other words, ab = b. This is a true statement: a This is a false statement:
a+b = b. a
1.2.2 Multiplication Fraction multiplication is easier than fraction addition. In general, simply multiply the two numerators together to obtain the new numerator and multiply the two denominators together to obtain the new denominator. Symbolically, a b ab × = . c d cd Example 1. Multiply the following fractions: × . We multiply the numerators, × = , and we multiply the denominators, × = . Therefore, the product is . But notice that the numerator and denominator share a factor of 2. So to write this fraction in lowest terms, we divide the top and bottom by 2, leaving . Another way to think about this problem is that “half of two thirds is one third.”
CHAPTER 1. ALGEBRA REVIEW
9
One thing to look out for is an opportunity to cross-cancel these fractions. If the numerator of one fraction and the denominator in the other share a factor, that factor will cancel out in the product. Life will be easier for you if you cancel out this number before multiplying. Example 2. Multiply the following fractions: ×
.
and share a factor of , and and share a common factor of . After dividing by the common factors, we have and , and and , respectively. After cross-cancellation, ×
×
=
=
.
1.2.3 Division Fraction division looks tricky, but it is just fraction multiplication with one extra step. Recall that dividing by a number is equivalent to multiplying by the reciprocal of that number. Therefore, a a d ad c = × = , b c b bc d where b, c, d ̸= . Simply take the reciprocal of the bottom fraction, and multiply as above, cross-canceling when you can. Example 1. Divide the following fractions: . We rewrite the problem as a multiplication problem by multiplying the top fraction by the reciprocal of the bottom fraction: × . The 10 in the denominator of the first fraction and the 5 in the numerator of the second fraction share a common factor of 5, which we remove by cross-canceling; then we multiply the numerators and denominators together: × = .
Example 2. This example uses exponents (discussed in Section 1.3) and factoring (discussed in Section 1.7). If this example seems too complicated at this point, return to it after reading those sections. Simplify x+ x . (x + )(x − ) x
MATHEMATICS FOR SOCIAL SCIENTISTS
10
We reciprocate, cross-cancel, and multiply: x+ x x+ x = (x + )(x − ) x (x + )(x − ) x =
xx−
=
x(x − )
.
1.3 Exponents An exponent is shorthand for repeated multiplication. The exponent refers to the number of times a number is multiplied by itself. So = × × = . Expressions with exponents have a particular anatomy. Consider an expression such as xa , where a and x can be either constants or variables (the same math applies in either case). a is the exponent, and x is the base of the expression. We can say that x is taken to the ath power. There are a number of rules to easily perform algebra on exponents. Familiarizing yourself with these rules is important for solving equations with exponents quickly and correctly. Let a and b be real numbers in the expressions below. x can be a number or a variable. 1. Multiplication of two numbers with the same base and different exponents. Add the exponents together: xa xb = xa+b . 2. Division of two numbers with the same base and different exponents. Subtract the bottom exponent from the top exponent: xa = xa−b . xb 3. Negative exponents. Take the reciprocal, but change the negative exponent to a positive one: x−a =
4. Exponents of 0. produces 1:
. xa is undefined, but taking any other number to the zero power x = for any x ̸= .
5. Exponents of 1. Taking the first power of any number returns the number: x = x for all x. 6. Multiple layers of exponents. Multiply the different layers together: (xa )b = xab .
CHAPTER 1. ALGEBRA REVIEW
11
7. Multiplication of numbers with different bases and the same exponent. The product of the different bases can be set to the common exponent: xa ya = (xy)a . 8. Division of numbers with different bases and the same exponent. The fraction of the different bases can be set to the common exponent: ( )a xa x = . a y y To provide some intuition behind these rules, consider some examples. Example 1. Consider the product of x and x : (x )(x ) = (xxxx)(xxx) = xxxxxxx = x .
Example 2. Simplify. x . x If we write out the exponents, we can begin canceling out individual xs: xxxxxxx x = = xxxx = x . x xxx
Example 3. Consider the cube of x : (x ) = x x x = (xxxx)(xxxx)(xxxx) = xxxxxxxxxxxx = x .
Example 4. Consider the product of x and y : x y = (xxxx)(yyyy) = xxxxyyyy = (xy)(xy)(xy)(xy) = (xy) .
1.4 Roots Anything that can be done in mathematics can be undone. When two operations cancel each other out, they are called inverse operations. Two operations are inverse when performing both operations is the same as performing neither operation. In other words, if we start with c, and then perform two inverse operations, we should be left with c. Addition is canceled by subtraction. Multiplication is canceled by division. Constant exponents are canceled out by √ roots. The square root of a number c is denoted as c. Here, notice that √ √ c = c and ( c) = c, so squaring a number and taking the √ square root of a number are inverse operations. The cube root of a number c, denoted c, cancels out the cube of a number. We define the nth
MATHEMATICS FOR SOCIAL SCIENTISTS
12
root of a number as the root that cancels out an exponent n. Roots are also sometimes called √ radicals. We denote the nth root of a number c as n c. Then, for n ̸= , √ √ n n n c = c and (n c) = c. Roots can really be thought of as divided exponents. In other words, we can always rewrite a root in the following way: √ n c = cn . By rewriting a root in this way, most of the rules outlined above for exponents also hold for roots. √ √ √ a+b 1. a xb x = x a x b = x a + b = x ab = (xa+b ) ab =ab xa+b . √
2. 3. 4.
a √x
x
b
−a
=
√
xa xb
= xa−b = x
x = x− a =
= xa
b−a ab
a
= (xb−a ) ab =ab
√ xb−a .
√ . x
√ x = x for all x, so we never consider these first roots.
5.
b
√ √ √ x. a x = (x a ) b = x ab =ab
6.
a
√ √ √ xa y = x a y a = (xy) a =a xy.
The most important √ rule√states that roots can be broken up across √rule is this √ last one. This = × = = × = . multiplication. So Roots can be simplified by breaking the number inside the root down into its lowest factored form. Recall from Section 1.1 that any number can be broken down into a string of prime numbers multiplied together. For example, = × = × × = × × × = × × × × = × × × × . For a square root, if two factors inside the root are the same, they produce a square that can be brought outside the square root. So for our example, √ √ = × × × × √ × × = √ = × × √ = .
CHAPTER 1. ALGEBRA REVIEW
√ √ Here, is called the reduced radical form of . Writing radicals in this way some√ times helps clear up complicated algebraic expressions. If we had to consider , we could √ √ rewrite it as = , which is much easier to deal with. The “Jailbreak Method”: Here’s an analogy I often use to explain radical simplification. A root is like a prison, and the prime factors are like inmates planning a jailbreak. But, as any good jailbreak movie would indicate, an inmate cannot break out of jail alone; he needs a partner. For a square root, two like factors team up to break out of jail, but as they run for the fence, one of them is spotted by the guard tower and shot. The other one escapes. For a cube root, three factors have to team up, and only one escapes. For an nth root, n factors team up, and still only one survives and escapes.
There is an important difference between nth roots where n is an even number and nth roots where n is an odd number. nth roots are undefined for negative numbers when n is even, but they are defined for negative numbers when n is odd. The most important rule to remember here is that the square root (or any even nth root) of a negative number is not a real number.4 Why can we take an odd root of a negative number, but not an even root? Let y be the square root of x, so that √ x = y. By the definition of a square root, y should be the number such that when we square it we get x, so y = x. y is always greater √ than or equal to 0, so x must always be greater than or equal to 0. As a consequence, x does √ not exist if x is negative. More generally, whenever n is an even integer, yn ≥ , so n x does not exist if x is negative. But when we consider odd roots, negative values can make sense because odd√exponents preserve the sign of a number. For example, − = − × − × − = − , so − = − . √ There is one more confusing point about roots that should be addressed. Consider finding . There are two solutions here. Clearly is one of the answers since × = . But − is also an answer since − × − = √. Any even root of a positive number really has two solutions, not one. But try evaluating on any calculator; you will see that the calculator provides as the answer and not − . Is the calculator wrong? Not exactly; the calculator was programmed to ignore − as a valid solution. Mathematical functions are useful in that they provide one and only one answer to a problem and avoid ambiguity. So when we consider the square root of a number, we only consider the positive answer and ignore the negative answer, not because it is any less valid but because it is more convenient to obtain one and only one answer. 4 Sometimes engineers, physicists, and other scientists find it useful to consider the square root of a negative number. But such a number is not a real number, so they use a concept called an imaginary number. The imaginary number √ is defined as − , which they denote as i. i is one of the fundamental √ π. The square √of mathematics, √ like √ constants = − × = − = i. root of any negative number can be described with i: for example, −
13
MATHEMATICS FOR SOCIAL SCIENTISTS
14
1.5 Logarithms To solve any equation, we have to be able to isolate the variable x on one side of the equation. When we encounter an expression like xa , where a is a constant, we can take the ath root to cancel out the exponent and leave us with x by itself. Such expressions are called power functions. However, when we have an expression like ax , we cannot take any root to cancel out the base and leave us with the exponent. Such expressions are called exponential functions. So the question considered here is how to isolate a variable x if x is in an exponent. We have to use a mathematical operation called a logarithm. Definition: Logarithm. loga (y) (read as “the log-base-a of y”) is equal to x, where x is the solution to the following equation: ax = y.
An exponential function has three parts: a constant base a, a variable exponent x, and y, the value that the exponential function is equal to. Likewise, a logarithm has a constant base a. If you give a logarithm the number y, it will provide the value of x that will make ax = y. Example 1. Evaluate log (1,000). log (1,000) is equal to x, where x is the solution to x = 1,000. Our task is to find x. It is clear that x = since = 1,000. Therefore, log (1,000) = .
Logarithms cancel out bases because of the following property, derived from the definition of a logarithm. We start with the exponential equation: ax = y. The definition of a logarithm tells us that loga (y) = x. Then we see cancellation by taking the log-base-a of both sides of the exponential equation: ax = y, loga (ax ) = loga (y), loga (ax ) = x. So taking the logarithm of base a of an exponential function of base a leaves you with the variable exponent itself. There are many different kinds of logarithms; they vary based on their different bases. The two most common types of logarithms are the common logarithm, which has a base of 10, and the natural logarithm, which has a base of e.
CHAPTER 1. ALGEBRA REVIEW
15
The number e. e is a fundamental mathematical constant. It is extremely important in many subfields and applications of mathematics. In inferential statistics, for example, it serves as the constant base of the normal distribution. Like π, e is irrational and has infinitely many digits. e is named after the famous mathematician Leonhard Euler (pronounced “Oiler”) and is equal to the following infinite sum: ( ) ( ) ( ) ( ) × + × × + × × × + × × × × + ... + + ≈ .
...
Most statistical and mathematical software packages, including Stata and R, use the natural logarithm (loge ) for logarithmic calculations by default. e is discussed in much greater detail in Section 4.4.
There are some notational rules to remember about logarithms. The common logarithm of y, log (y), is usually denoted just log(y); the base is left off of the notation. The natural logarithm of y, loge (y), is usually denoted as ln(y).5 Most scientific calculators can evaluate common and natural logarithms but not other logarithms. To evaluate other logarithms with a calculator, there is a formula to change the base of a logarithm: logb (y) =
loga (y) . loga (b)
A scientific calculator has no function to evaluate log , for example. But the above formula can be used to express the function in terms of common or natural logarithms. Example 2. Consider the expression log ( natural logarithms.
). Rewrite this expression in terms of common and
• To convert the logarithm to an expression using only common logarithms, log (
)=
log ( ) log( ) = . log ( ) log( )
• To convert the logarithm to an expression using only natural logarithms, log (
)=
loge ( ) ln( ) = . loge ( ) ln( )
All logarithms regardless of their base share the same properties and rules. These rules are as follows:
5 Most sources in mathematics use these notations for the common and natural logarithm. However, some work in statistics and econometrics, which use the natural logarithm almost exclusively, refers to the natural logarithm as log(y) instead of ln(y). This can be confusing. As a general rule, it is safest to assume that log(y) is the common logarithm in mathematics, physics, engineering, and other hard mathematical sciences unless stated otherwise and to assume that log(y) refers to the natural logarithm in economics, statistics, and other applied and social sciences unless otherwise stated.
MATHEMATICS FOR SOCIAL SCIENTISTS
16
1. The logarithm of a product. A logarithm turns multiplication inside the logarithm to addition outside the logarithm: loga (xy) = loga (x) + loga (y). 2. The logarithm of a quotient. A logarithm turns division inside the logarithm to subtraction outside the logarithm: ( ) x loga = loga (x) − loga (y). y 3. The logarithm of an exponential or power function. Logarithms turn exponents inside the logarithm to factors outside the logarithm: loga (xc ) = c loga (x). 4. Logarithms will cancel out the base of an exponential function if it shares the same base: loga (ax ) = x. 5. Conversely, exponential bases cancel out logarithms if they have the same base: aloga (x) = x. • For the common logarithm, these rules can be written as log(
x
) = x and
log(x)
= x.
• For the natural logarithm, these rules are as follows: ln(ex ) = x and eln(x) = x. 6. Any logarithm of 1 is equal to 0, no matter what base is being used: loga ( ) = . 7. The logarithm of the base is equal to 1: loga (a) = . So for the common and natural logarithms, we have, respectively, log( ) = and ln(e) = . 8. The logarithm of a negative number does not exist. Sometimes, people say that the logarithm of 0 is equal to −∞, but for our purposes, only logarithms of positive numbers are defined. 9. The logarithm of a reciprocal. In addition to breaking it up into subtraction, the logarithm of a fraction is equal to − times the logarithm of the reciprocal. For any quantities a, b, c > , ( ) ( ) loga ab = − loga ba and loga
( ) c
= − loga (c).
CHAPTER 1. ALGEBRA REVIEW
17
Example 3. log( ) = − log( ).
Example 4. Use the properties of logarithms to simplify the following expression: ( ) x e y+ ln . z First, use the quotient rule to turn the fraction into subtraction: ( ) ( ) x e y+ z . ln = ln(x e y+ ) − ln z Then the factors can be broken up into addends: ( ) = ln(x ) + ln(e
y+
) − ln
− ln(z).
Finally, we can use the power rule, the reciprocal rule, and the cancellation of the natural logarithm with e to clean up the expression even more: = ln(x) + y + + ln( ) − ln(z).
Logarithms find extensive use in the social sciences and statistical research for two reasons. First, logarithms turn products into sums, and it is often much easier to deal with terms that are added together rather than multiplied together. Second, logarithms break large numbers down into small numbers (e.g., log( ) = log( ) = ) without changing the ordering of quantities. In microeconomics, researchers usually perform analyses of personal income by taking the logarithm of income.
1.6 Summations and Products Sometimes you will work with equations in which quite a lot of numbers are added together or multiplied together. In some cases, an infinite number of terms are added or multiplied. Writing out these long sums and long products would ∑ be very cumbersome, so there are two symbols that simplify these equations. The symbol refers to a summation, and the symbol ∏ refers to a long product. These symbols are nothing to be intimidated by. In fact they make long equations much easier to read and write. Summations and products an expression inside them that is repeatedly added to or ∑ have∏ multiplied by itself. Both and use an index variable, often denoted i, which is a part of this expression. The index variable is a counter. Summations and products start with a particular value of i, increment values of i by 1, and continue ∑ until∏some terminating value is reached. The starting value of the index is listed below the and symbols, and the ending value is listed above them.
MATHEMATICS FOR SOCIAL SCIENTISTS
18
For example, the expression ∏
i
i=
is equal to × × × × =
,
and the expression ∑
i
i=
equals +
+
+
+
= + + + +
=
.
Summations and products can be used to represent an infinite number of addends or products. Suppose Amtrak designed a new kind of train to offer rides from Charlottesville to Washington. This new method will take you half the remaining distance to your destination every minute. So after 1 minute, you are already halfway there. In the second minute you travel another th of the distance, and in the third minute you travel another th of the distance. Suppose Charlottesville and Washington are 100 miles apart. We can represent the distance you travel after 3 minutes as ∑ i
=
+
+
. + .
. =
. miles,
i=
and after 6 minutes as ∑ i
=
+
+
=
.
miles . . .
+ .
+ .
+ .
i=
We can write the distance after an infinite number of minutes as ∞ ∑ , i i=
which gets closer and closer to 100 miles without ever equaling it. We can also represent general sums. Suppose you’ve taken a survey of N respondents. The respondents are given ID numbers from 1 to N, and each respondent reports his or her annual income. The income of respondent 1 is denoted as x , the income of respondent 2 is x , and so on. The mean income of all the respondents in your survey is the sum of all the incomes divided by the total number of respondents. In summation notation, this average is ∑N i= xi . N Finally, summation and product notation can be used to scale down very, very large expressions. Consider a polynomial (discussed in greater detail in Section 2.5). A polynomial is the
CHAPTER 1. ALGEBRA REVIEW
19
sum of various powers of a variable, all multiplied by a different constant. In the expression below, the a terms are all constants: an xn + an− xn− + · · · + a x + a x + a x + a . As shown in Table 1.1, I can rewrite this ugly expression in summation notation: n ∑
ai xi ,
i=
which is a much cleaner way to write the expression. Table 1.1
Breaking Down Summation Notation ∑n
i
ai xi
0
a
a
1
ax
a x+a
i=
ai xi
2 a x
a x +a x+a
3 ax .. .. . .
a x +a x +a x+a .. .
n
an xn an xn + an− xn− + · · · + a x + a x + a x + a
It is important to remember that a summation is just addition, nothing more. Likewise, a long product is only multiplication. All of the rules for simplifying expressions with addition and multiplication also apply to summations and to long products. The following properties of summations are easily derived from the properties of the operation of addition: 1. Commutativity means that the order in which terms are added does not change the value of the sum. For two real numbers a and b, a + b = b + a. For summations, an implication of commutativity is that summations can be broken up across addition (or subtraction). Consider the summation N ∑ (ai + bi ). i=
The first term in the summation is (a +b ), the second term is (a +b ), and in general the jth term is (aj + bj ). The sum of these terms from 1 to N is a + b + a + b + · · · + aj + bj + · · · + aN + bN .
MATHEMATICS FOR SOCIAL SCIENTISTS
20
Since we can rearrange the order of these terms without changing the sum, we can rewrite the sum as a + a + · · · + aj + · · · + aN + b + b + · · · + bj + · · · + bN = (a + a + · · · + aj + · · · + aN ) + (b + b + · · · + bj + · · · + bN ). This sum can be written more neatly using the following summation notation: N ∑
ai +
i=
N ∑
bi .
i=
So summations have the following property: Commutative property of summations: Summations can be broken up over addition, N N N ∑ ∑ ∑ (ai + bi ) = ai + bi , i=
and subtraction,
N ∑
i=
i=
(ai − bi ) =
i=
N ∑
ai −
i=
N ∑
bi .
i=
2. Multiplication is defined as repeated addition. If we add a real number a to itself N times, the sum is a + a + a + · · · + a = Na. | {z } N times
Likewise, if the term inside a summation does not include the index, then it does not change from one iteration to the next within the summation and the same thing gets added over and over again. The definition of multiplication implies the following property of summations: Multiplication property of summations: If the term inside the summation does not include the index i, then N ∑
a = Na.
i=
3. Factoring: If every term in the sum is a product, and if all of these products share the same factor, then we can bring this factor outside the sum as follows: ab + ab + · · · + abN = a(b + b + · · · + bN ). This property applies to summations as well:
CHAPTER 1. ALGEBRA REVIEW
21
Factoring property of summations: If the term inside a summation is a product in which one of the factors does not depend on the index i, then that factor can be brought outside the summation: N ∑
N ∑
abi = a
i=
bi .
i=
4. Nondistribution of exponents: This is not so much a property as a warning not to make a common mistake. If a sum is taken to an exponent, the exponent cannot be distributed to the individual addends. It is important to remember that (a + b) ̸= a + b . When there are two added terms inside the same expression to be squared, the formula is (a + b) = a + ab + b , as will be discussed soon in Section 1.7.2. If three or more factors are in the expression to be squared, the distribution of the exponent is much more complicated. For a summation, be very cautious with exponents, and pay particular attention to whether the exponent is inside or outside the summation: Nondistribution property of summation exponents: It is possible for the individual terms inside a summation to each be taken to an exponent, such as N ∑ (ai ). i=
It is also possible for the entire summation itself to be taken to an exponent, for example, (∑ ) N ai . i=
These two expressions are not in general equal: (∑ ) N N ∑ (ai ) ̸= ai . i=
i=
It’s easy to write a summation with exponents in a confusing way. For example, if we write N ∑ ai , i=
it might be unclear whether we are squaring each individual ai term or the whole summation. One should also use caution with addends inside a summation.
MATHEMATICS FOR SOCIAL SCIENTISTS
22
For example, in the expression N ∑
ai + b,
i=
the b term does not depend on the index, but it is unclear whether b is inside or outside the summation. If the explicit meaning of the summation cannot be inferred from the context, then it is always safer to use parentheses. These properties can be used together to rewrite many expressions with summations. Example 1. Rewrite the following expression: N ∑ (xi + ) . i=
Note that the square refers to each individual (xi + ) term and not to the whole summation. First, we can multiply the square out, which gives us N ∑ (xi + xi + ). i=
The commutative property of addition allows us to break this summation up across addition as follows: N N N ∑ ∑ ∑ xi + xi + . i=
i=
i=
The second summation includes a constant factor 2, which the factoring property allows us to bring outside the summation: N N N ∑ ∑ ∑ xi + xi + . i=
i=
i=
Finally, the third summation contains a constant that does not depend on the index. The multiplication property allows us to rewrite this summation as N ∑ i=
xi +
(∑ ) N xi + N. i=
The properties of multiplication can also be applied to long products: 1. Commutativity is the order in which factors are multiplied does not change the value of the product. For two real numbers a and b, ab = ba. Commutativity implies that long products can be broken up across multiplication. Consider the long product N ∏ ai bi . i=
CHAPTER 1. ALGEBRA REVIEW
23
This expression is equal to the product of each individual ai bi term: a b a b a b . . . aN bN . Since we can rearrange the factors, this product can be rewritten as a a a . . . aN b b b . . . bN = (a a a . . . aN )(b b b . . . bN ) N ∏
=
ai
i=
N ∏
bi .
i=
Commutative property of long products: Long products can be broken up over multiplication, N ∏
ai bi =
N ∏
ai
N ∏
i=
i=
bi ,
i=
∏N N ∏ ai ai = ∏i= . N b i i= bi i=
and division
Note that, in general, long products do not break up over addition, N N N ∏ ∏ ∏ (ai + bi ) ̸= ai + bi , i=
i=
i=
and summations do not break up over multiplication, N ∑
ai bi ̸=
N ∑
i=
ai
i=
N ∑
bi .
i=
2. Exponentiation is defined as repeated multiplication. If we multiply a number by itself N times, we are taking that number to the power of N: a × a × a × · · · × a = aN . | {z } N times
If the term inside a long product does not include the index, then it does not change from iteration to iteration of the long product. In that case, we are multiplying the same number by itself over and over. The definition of exponentiation implies the following property of long products: Exponentiation property of long products: If the term inside the long product does not include the index i, then N ∏ i=
a = aN .
MATHEMATICS FOR SOCIAL SCIENTISTS
24
3. Factoring: If every factor in a product is itself a product and if all of these products share the same factor, then these common factors can be combined and exponentiated: (ab ) × (ab ) × (ab ) × · · · × (abN ) = aN b b b . . . bN . This property applies to long products as well: Factoring property of long products: A constant factor can be brought outside a long product as long as it is exponentiated: N N ∏ ∏ abi = aN bi . i=
i=
4. Distribution of exponents: Unlike summations, exponents outside a product can be distributed to each term inside a product. In addition, if each term in a product has an exponent, and these exponents are equal, then that exponent can be brought outside to be applied to the entire product. For two real numbers a and b, this property is (ab) = a b . For a long product, this property is as follows: Distribution property of exponents in long products: An exponent outside a long product can be placed on every individual term inside the long product, and common exponents inside a long product can be brought outside to apply to the entire long product. For any real number exponent m, (∏ )m ∏ N N = ami . ai i=
i=
5. Sum of exponents: Recall from Section 1.3 that if two numbers with the same base and different exponents are multiplied together, then the product is the base taken to the power of the sum of the exponents of the two factors: xa xb = xa+b . The property applies to long products as well: Summed exponent property of long products: Consider a long product in which the term inside the product is exponential, where the base does not depend on the index but the exponent does depend on the index: N ∏
abi .
i=
This long product is equal to the base of one term taken to the summation of the exponents: N ∏ i=
∑N
abi = a
i=
bi
.
CHAPTER 1. ALGEBRA REVIEW
25
All of these properties can be used together to rewrite expressions with long products. Example 2. Rewrite the following expression: N ∏
√
i=
π
e−
xi
,
where e is a constant.6 Before we apply the rules of long products, let’s simplify the expression inside the long product. Recall from Sections 1.3 and 1.4 that setting a number as the denominator of 1 is the same as taking the number to the power of − and taking the square root is the same as taking the number to the power of . Taken together, these two properties imply that the number π is taken to the − power: N ∏ ( π)− e− xi . i=
The first factor does not depend on the index, so it can be brought outside the summation and exponentiated: ( )N ∏ N N N ∏ − x ( π)− e i. e− xi = ( π)− i=
i=
The remaining terms inside the long product have a constant base of e, so the exponents can be placed inside a summation: N ∑N ( π)− e i= − xi . Finally, the constant factor of − can be brought outside the summation: N
( π)− e−
∑N i=
xi
.
1.7 Solving Equations and Inequalities 1.7.1 Isolating a Variable An equation is an equation exactly because the left- and right-hand sides of the equation are equal. Think of a scale that is perfectly balanced; what is on the right side of the scale weighs the same as what is on the left side of the scale. We can place more on each side of the scale or take away from each side of the scale and preserve the balance as long as we do the same thing to both sides of the scale. Likewise, we preserve the equality in an equation as long as we do the same thing to both sides of the equation. Our task is to find the value of x such that the two sides of the equation are equal. Here is our technique for solving equations: Isolate x by adding, subtracting, multiplying, or dividing both sides of the equation by the same thing. We can also take a root of both
6 In
fact, it is Euler’s constant, which is as important as π. See Section 4.4.
MATHEMATICS FOR SOCIAL SCIENTISTS
26
sides, set both sides to the same power, set both sides as exponents of the same base, or take the logarithm of both sides. Consider the following equation: x+
=
.
Our goal is to find a number x such that if we multiply it by 4 and add it to 10, it equals 30. Instead of guessing and checking our answers, we calculate the answer precisely by adding, subtracting, multiplying, and dividing both sides of the equation by the same number to get x alone on one side of the equation. First, we can subtract 10 from both sides: x+
−
− .
= x=
,
Next we divide both sides by 4: x
=
,
x= . Example 1. Solve
√
z− = .
The square root can be canceled by squaring both sides of the equation. Note that we will have to disregard any answer that sets z < , as this would place a negative inside the square root. √ ( z− ) = , z− = , z=
.
An important technique in solving equations is the method of combining like terms. Like terms are products that contain the same variable, and we add or subtract these like terms just as we can add and subtract constants. For example, in the equation x+ x=
,
x and x are like terms because they are products that contain the same variable, x. We can add them together: x= , x= . It might help to think of these terms as counts of a particular kind of object. Think of x as “5 xylophones” and x as “7 xylophones.” Five xylophones plus 7 xylophones is 12 xylophones. In these algebraic equations, it is important to bring all the terms that contain x to the same side of the equation and to combine them. Example 2. Solve x + = x − . First, we subtract 5 from both sides: x= x−
.
CHAPTER 1. ALGEBRA REVIEW
27
Now, to put all the terms that contain x on the same side of the equation, we subtract x from both sides: x− x=− . Note that x and x are like terms, so we can subtract them x=− , then dividing by 4 gives us the answer x=− .
1.7.2 Distribution and Factoring One important rule to remember about numbers is the distribution property of real numbers. This rule states that for any real numbers a, b, and c, a(b + c) = ab + ac. That is, a number in front of a sum or difference in parentheses must be multiplied by every term inside the parentheses. Example 1. Solve
− y = . y− First things first: Note that y = cannot be accepted as an answer to this problem because that would leave 0 in the denominator. If y = is the only solution, then there is no answer here. First, multiply both sides of the equation by (y − ): − y = (y − ). We use the distribution property to distribute the 2 on the right-hand side of the equation: − y= y−
.
Then we add and subtract terms to get all of the y terms on one side and all the constants on the other: +
= y + y, = y, y=
=
.
Sometimes distribution is not so easy. Consider the following expression: ( x − )(x + ). We have to perform three distributions here. First, we distribute the ( x − ) to the terms x and : x( x − ) + ( x − ). Now we distribute the x and the : x − x+ x−
= x + x−
.
MATHEMATICS FOR SOCIAL SCIENTISTS
28
There is a shortcut for this kind of repeated distribution, as indicated by the acronym FOIL.
• • • •
F: multiply the first terms, O: multiply the outside terms, I: multiply the inside terms, L: multiply the last terms, and add them all together.
For the above example, the first terms multiply to x , the outside terms to x, the inside terms to − x, and the last terms to − . Adding together, we once again get x + x− x−
= x + x−
.
The expression x + x − is called a quadratic expression. Any expression that is the sum of a constant, a multiple of a variable x, and a multiple of the square of that variable is a quadratic expression. We represent a generic quadratic expression as ax + bx + c, where a, b, and c are constants. The distribution property shows us that expressions such as (x + y) equal x + y, but we observe that the distributive property also works the other way. If we see something like x + y, we need to be able to recognize that we can pull that 5 out to the front if we want to. Pulling out common factors, or using the distributive rule in reverse, is called factoring. Being able to factor expressions is a very important technique for solving equations. Example 2. Solve x −
x= .
First, note that the terms in this equation share the common factors of 5 and x, which we can factor out: x(x − ) = . The left-hand side of this equation is equal to 0 only when x = this equation are x = and x = .
or x = . So the solutions to
Like everything else in mathematics, FOIL has a technique to reverse it. To factor any quadratic expression, ax + bx + c, I find it useful to follow these steps: 1. Multiply a and c, and list all the factor pairs of this product. 2. Check to see if there exists a factor pair that adds to b. If there is no pair that adds to b, then the expression cannot be neatly factored. If there is, rewrite the expression by breaking b up into these addends. 3. Bring the common factors out of the first two terms, and the common factors out of the last two terms. What remains from the first two terms should be the same as what remains from the last two, and it can be factored out from the whole expression.
CHAPTER 1. ALGEBRA REVIEW
29
Example 3. Factor x + x −
.
• Step 1. First observe that a =
and c = − . The product of a and c is − , which means that one factor must be positive and one must be negative. The factor-pairs are (− and ), (− and 12), (− and 8), (− and 6), (1 and − ), (2 and − ), (3 and − ), and (4 and − ). • Step 2. We find the factor pair that adds to b. In this case, b = , and the factor pair (-3 and ) adds to . The expression can be rewritten as x − x + x − . • Step 3. We can rewrite our expression as ( x − x) + ( x −
),
which equals x( x − ) + ( x − ). The x − can be factored out, leaving us with ( x − )(x + ).
Example 4. Can the expression x +
x − be neatly factored?
• Step 1. a = and c = − . The product of a and c is − , and the factor pairs are (1 and − ), (2 and − ), (− and 6), and (− and 3).
• Step 2. None of these factor pairs add to b =
; therefore this expression cannot be neatly
factored.
When a = in quadratic expressions, factoring becomes easier. To factor an expression of the form x + bx + c, we only have to find two numbers that add to b and multiply to c. Suppose d and d add to b and multiply to c. Then x + bx + c = (x + d )(x + d ). When the two numbers that add to b and multiply to c are the exact same number, we call the quadratic expression a perfect square. For example, consider x + x + . The two numbers that add to 6 and multiply to 9 are 3 and 3. Therefore, x + x+
= (x + ) .
Be careful. It is natural to want to distribute the square. But (x + ) ̸= x + . Do not distribute exponents to added terms. Instead, use FOIL. There are some other special cases to know involving factoring. First, consider the following FOIL problem: (x + a)(x − a) = x − ax + ax − a =x −a . The fact that the two middle terms cancel out gives us the difference-of-squares formula listed below. There are formulas for cubes, derived similarly, which are also useful. For any number a, Difference-of-squares formula: x − a = (x + a)(x − a).
MATHEMATICS FOR SOCIAL SCIENTISTS
30
Difference-of-cubes formula: x − a = (x − a)(x + ax + a ). Sum-of-cubes formula: x + a = (x + a)(x − ax + a ).
It is easy to confuse the difference-of-cubes formula and the sum-of-cubes formula. Notice that these two formulas differ only in the three signs between the terms. To remember the formulas, it is helpful to use the acronym SOAP, which stands for “same, opposite, always positive” (see Table 1.2). Table 1.2
SOAP: How to Remember the Difference of Cubes and Sum of Cubes Formulas S
Formula
O
Same
AP
Opposite Always Positive
Difference of cubes (x − a ) = (x − a)
(x +
ax + a )
(x + a ) = (x + a)
(x −
ax + a )
Sum of cubes
The “S” in SOAP refers to the first sign, between x and a in the first term. This sign should be the same as the sign in the term being factored: subtraction for the difference of cubes and addition for the sum of cubes. The “O” refers to the sign between x and ax in the second term. This is the opposite of the sign in the term being factored: addition for the difference of cubes and subtraction for the sum of cubes. Finally, the sign between ax and a in the second term is “AP”—always positive, meaning it is always an addition sign.
Example 5. Simplify
x − x −
.
Note that = and = . The numerator is a difference of squares and the denominator is a difference of cubes. The expression can be rewritten as (x + )(x − ) (x − )(x + x + which reduces to
)
,
x+ . x + x+ Note that if we had been solving an equation we would have to discount an answer of x = , since that would make the denominator in the original expression equal to 0.
CHAPTER 1. ALGEBRA REVIEW
31
1.7.3 Solving Quadratic Equations An equation of the form ax + bx + c = is called a quadratic equation. If it is possible to factor ax + bx + c, then the solutions for x are easy to find once the expression is factored. Just find the value of x that makes each individual factor equal to 0. Example 1. Solve x − x −
= .
We can factor the left-hand side. Note that two numbers that add to − and multiply to − are 2 and − . Therefore, the equation becomes (x + )(x − ) = . The left-hand side is 0 exactly when x = − or x = . Therefore the solutions are x = − and x= . Example 2. Solve ln(x + x + ) = . Note that x + x + = (x + ) , which is always greater than or equal to 0. The only value of x that we cannot allow as an answer is x = − . We can rewrite the equation as ( ) ln (x + ) = . We can use the properties of logarithms to simplify the equation: ln(x + ) = , ln(x + ) = . Then, setting both sides as a power of the number e will cancel out the natural logarithm: eln(x+ ) = e , x+ =e , x=e − .
Sometimes it is not possible to neatly factor a quadratic equation. In this case, there is still a formula we can use to solve for x. Let a, b, and c be constants. To solve equations of the form ax + bx + c = , use the quadratic formula: √ −b ± b − ac x= . a The ± sign refers to two possible solutions for x, one in which this sign is addition and one in which this sign is subtraction. These two solutions are distinct from one another when b − ac (called the discriminant) is positive. When the discriminant is 0 the two solutions are the same, so we say that only one solution exists. Finally, when the discriminant is negative, there are no solutions because we cannot take the square root of a negative number.
MATHEMATICS FOR SOCIAL SCIENTISTS
32
Example 3. Solve x−
x +
= .
Remember from Example 4 in Section 1.7.2 that this quadratic expression cannot be neatly factored. Therefore, there’s no shortcut to finding the solutions for x, and we have to resort to using the quadratic formula. Here a = , b = , and c = − . Plugging these numbers into the quadratic formula, we get √ −b ± b − ac x= ( a) √ − ± ( ) − ( )(− ) = ( × ) √ − ± + = √ − ± . = Recall from Section 1.4 that we can simplify a square root by writing the number inside the root in terms of its prime factorization and removing duplicated factors. In this case, √ √ √ = × × × × = , so the solutions reduce to (−
±
√
( x=
− +
√
√ ± √ =− ± , √ ) ( and x = − − )
=−
) .
1.7.4 Solving Inequalities An inequality is any expression that involves one of the inequality signs: • • • •
>, which stands for “greater than,” ≥, which stands for “greater than or equal to,”
.
We can solve this inequality in the same way we solve the equation − x + subtract 7 from both sides: − x> .
=
. First we
Then we divide both sides by − . Because we are dividing by a negative number, we have to flip the inequality sign. The solution is x , where d < d , we know that the left-hand side of the inequality is equal to 0 at x = d and x = d . We don’t yet know, however, where this expression exactly is greater than 0. These zero points give us three regions to consider: 1. x values less than d 2. x values between d and d 3. x values greater than d The easiest technique to solve the quadratic inequality is to plug one number from each region into the quadratic expression to see where it is positive and negative. Example 2. Solve x −
x+
< .
We solve this inequality the same way we would solve the quadratic equation x − x + = . First we try factoring. In this quadratic expression, a = and c = . The product is 42, and the factor pairs are (1 and 42), (2 and 21), (3 and 14), (6 and 7), (− and − ), (− and − ), (− and − ), and (− and − ). This expression factors neatly if one of the factor pairs adds to b = − , and that’s true for (− and − ). So we rewrite the inequality as x − x−
x+
( x − x) − ( x −
< , )< .
x factors out of the first term, and 7 factors out of the second term, so we write x( x − ) − ( x − ) < ,
MATHEMATICS FOR SOCIAL SCIENTISTS
34
and then we factor ( x − ) from both terms: ( x − )(x − ) < . We can see that the quadratic expression is 0 when x = and when x = , but we need to check where the expression is negative. First, we check the x values less than by testing x = , for example: ( )( ) ( )− − =− ×− = , so the quadratic expression appears to be positive for values of x less than . Next we test x values between and 7 by considering x = as an example: ( )( ) ( )− − = ×− =− , so the expression is negative in this region. Finally, we test x > by looking at x = : ( )( ) ( )− − = × = , so the expression is positive here. Therefore, the solution consists of values of x between
and 7.
One important difference between single-variable equations and single-variable inequalities is that equations have a limited number of solutions while inequalities can have an infinite number of solutions. As in Example 2, we must find the range of x values that satisfy the inequality, and there may be infinitely many values in this range. The set of solutions that satisfy the inequality in Example 2 is graphed in Figure 1.2. A Graph of x −
x+
}.
To translate this statement, consider each piece of the notation. A is the name of the set. The brackets enclose the elements of a set. ∈ means “exists in.” R is the set of real numbers. | means “such that.” So the whole notation is read “A is the set of all real numbers x such that x is greater than 100” or simply “A is the set of all real numbers greater than 100.” If a researcher states that a property holds for numbers in set A, then the property holds “for reals greater than 100.” Figure 2.1
Saturday Morning Breakfast Cereal by Zach Weiner, Comic #2613
Source: http://www.smbc-comics.com/?id=2613
Set builder notation is also useful for expressing finite sets or even a single number if you want to illustrate why the elements are important. Recall from Example 2 in Section 1.7.3 that the solution to ln(x + x + ) = is x = e − . We can express this number as { } x ∈ R ln(x + x + ) = , which is read as “the set of real numbers x such that ln(x + x + ) = .” Finally, sometimes sets are most easily described in terms of how certain numbers relate to other numbers. To express a set in this way, we need to use the ∀ symbol, which is read as “for all,” and the ∃ symbol, which is read as “there exists” or “for some.” For example, an infinitesimal number is defined as a number that is infinitely small; it is greater than 0 but
CHAPTER 2. SETS AND FUNCTIONS
45
less than any positive real number. There is no way to express an infinitesimal number with numerals, but we can express it with the following set builder notation: {x ∈ R |
< x < m, ∀m ∈ R+ },
which translates to “the set of real numbers x such that x is greater than 0 but less than m for all numbers m in the set of positive real numbers.” Put more succinctly, this statement means “a real number that is greater than 0 and less than every other positive real number.” Recall from Section 1.1 that an even number is an integer that is twice another integer. To show that an integer z is even, we only need one example of another integer that equals z when multiplied by 2. For example, to prove that 10 is even, we only have to show that × is 10. In set builder notation, the set of even numbers is {z ∈ Z | z = n, ∃n ∈ Z}, which translates to “the set of integers z such that z equals two times n for some number n in the set of integers,” or just “integers that are equal to two times another integer.” Example 2. Express the following sets: (1) Q = the set of rational numbers. In Section 1.1, the rational numbers are defined as the set of real numbers that can be expressed as a fraction of integers. We only need to find one example of two integers that form this fraction. The set builder notation that corresponds to this definition is } { a Q = x ∈ R x = , ∃a,b ∈ Z , b which translates to “ the set of real numbers that equal the fraction ab , for some integer a and some integer b.” (2) ∞. ∞ is a number that is larger than all real numbers. We can define ∞ as a set this way: ∞ = {x | x > m, ∀m ∈ R}, which reads “a number that is greater than all real numbers.” Note that we do not write that x ∈ R in the set builder notation. ∞ is not considered to be a real number. (3) Let S be the set of bills voted on in the U.S. Senate this year. Let s ∈ S be one bill, and let s∗ represent the number of “yea” votes on the bill. Sixty yea votes are needed to break a filibuster. Express the set S∗ of bills that have a filibuster-proof majority. S∗ = {s ∈ S|s∗ ≥
}.
2.2 Intervals A particular kind of infinite set is an interval. Intervals include all real numbers between given end points. The interval may or may not include these end points. If an end point is included in the set, it is enclosed within a bracket. If the end point is not included in the set,
MATHEMATICS FOR SOCIAL SCIENTISTS
46
it is enclosed within a parenthesis. Remember that the real numbers include all the fractional numbers in between the integers, so the interval from 1 to 2 includes 1.1, 1.5, 1.99999, and so on. There are several kinds of intervals, listed in Table 2.2. Table 2.2
Types of Intervals
Interval Set Builder Notation Notation
Translation
[a, b]
{x ∈ R|a ≤ x ≤ b} All real numbers between a and b, including a and b
(a, b)
{x ∈ R|a < x < b} All real numbers between a and b, not including a or b
[a, b)
{x ∈ R|a ≤ x < b} All real numbers between a and b, including a but not b
(a, b]
{x ∈ R|a < x ≤ b} All real numbers between a and b, including b but not a
(a, ∞)
{x ∈ R|x > a}
All real numbers greater than a, not including a
[a, ∞)
{x ∈ R|x ≥ a}
All real numbers greater than a, including a
(−∞, b)
{x ∈ R|x < b}
All real numbers less than b, not including b
(−∞, b]
{x ∈ R|x ≤ b}
All real numbers less than b, including b
An interval [a,b] that includes the endpoints is called a closed set, and an interval (a,b) that does not include the end points is called an open set. If an interval such as [a,b] has two end points, a is also referred to as the lower bound and b is also referred to as the upper bound. Some intervals may only have one bound. ∞ is an infinitely large and positive number and is used to denote that an interval is unbounded above. That is, the interval [a, ∞) includes all real numbers greater than or equal to a. Conversely, −∞ is used to denote an infinitely negative number. The interval (−∞, b] contains all real numbers that are less than or equal to b. Intervals that include ∞ or −∞ place these values in parentheses, never brackets, since ∞ is not technically a real number.1 The interval (−∞, ∞) is the same as the entire set of real numbers. Set notation can be used for intervals as well. The expression [a,b) ∪ (c,d] includes all real numbers that are in either [a,b) or (c,d], or both. The expression [a,b) ∩ (c,d] includes only the real numbers that are in both [a,b) and (c,d].
1 In other words, a real number x can be less than ∞ but never equal to ∞; therefore, x < ∞ makes sense but x ≤ ∞ does not.
CHAPTER 2. SETS AND FUNCTIONS
47
Example 1. List the intervals that correspond to the following statements: (1) All real numbers between 3 and 7, including 3 and 7 [ , ]: Brackets are used since both 3 and 7 are included in the set. (2) All real numbers strictly less than − . (−∞, − ): Since no lower bound is specified, all infinitely negative numbers are included in the set. − is not included, so a parenthesis is used. (3) All real numbers x such that x −
x+
. x The domain is restricted to x ∈ ( ,∞) because the function specifies that x > . If, however, no restrictions on x were specified, then the value of x = would still be excluded so as to prevent division by 0. To determine the possible bounds of the range, consider the values of x in the domain that approach 0 and approach ∞: x
f(x) ( ) + = 1 ( ). .1 +. = . ( ). .01 +. = . ( ). .001 +. = . ( ). = . .0001 +.
x
f(x) ( + 10 ( 100 + ( 1,000 + ( 10,000 + ( 100,000 +
)
= .
)
= .
) ) )
= . = . = .
MATHEMATICS FOR SOCIAL SCIENTISTS
58
It appears that the values of f(x) approach 1 as the values of x approach 0, and approach a number close to 2.718 as the values of x approach ∞. In fact, as x gets larger, the value of this function approaches e! e pops up in unexpected places all over mathematics and is a favorite number of number theorists, numerologists, and mystics alike. A candidate range for f(x) in this case is therefore ( ,e). But we must make sure that values inside the domain of x do not exceed these bounds. The best way to test whether these bounds truly describe the range of f(x) is to graph the function. Figure 2.12 shows this graph. All of the values considered fall within ( ,e). Figure 2.12 The Graph of f(x) =
(
+
)x x
for x >
3.0
x
e = 2.7183 . . .
y
2.5
e = 2.7183 . . .
f (x1)
1.0
f (y)
2.0
f (x)
1.5
f(x)
x2
0
f (x2) 20
40
60
80
100
xx y 2.5 Polynomials
y A polynomial is1 a particular kind of function that has very important applications to statis0 tics and to social science research. fThese (y) functions are sums of the powers of x multiplied by constants, where the powers are whole numbers.2 We call the constants coefficients. The degree of a polynomial is the f value (x) of the largest exponent. For example, the function f(x) = x −1 x +
x + x + x−
f (y) 2 Remember from Section 1.1 that the whole numbers are the nonnegative integers and include 0. Polynomials often
include an addend that is just a constant. One can think of the constant as being multiplied by x , which equals 1.
CHAPTER 2. SETS AND FUNCTIONS
59
is a polynomial since the function is the sum of the powers of x times constants, where the powers are whole numbers. The numbers 7, − , 12, 8, 5, and − are coefficients. This polynomial is a fifth-degree polynomial since the largest power is 5. By far the most important kind of polynomial in statistics is the first-degree polynomial, which is called a linear function. Higher order polynomials (a phrase that is synonymous with “higher degree polynomials”) are also very useful. Both kinds of polynomials are discussed below.
2.5.1 Linear Functions and Linear Graphs A linear function is a first-degree polynomial that takes the form f(x) = mx + b, where m and b are constants and m ̸= . The graph of a linear function is a straight line, where m here is the slope and b is the y-intercept. These terms deserve to be defined explicitly. Definition: Slope The amount that a graph rises vertically for every unit the graph moves to the right. Curved functions have slopes that change depending on the position on the x-axis being considered, but linear functions have the same slope at every point in the graph. The slope of a linear function is traditionally denoted m but is usually denoted β in statistical models. If two points (x ,y ) and (x ,y ) are on the same line, then the slope of the linear function can be computed using the following formula: m=
y −y x −x
or, equivalently,
m=
y −y . x −x
Another way to think about the formula for slope is that the difference y −y is the “rise” of the line and the difference x −x is the “run” of the line. The slope is, therefore, the “rise over run” of the line. It does not matter whether the coordinates of the first point are subtracted from the coordinates of the second point, or vice versa, so long as the same difference is calculated for both the numerator and the denominator of the slope formula. A positive slope means that the graph gets higher as it moves toward the right, and it approaches the upperright and lower-left corners of the graph. A negative slope means that the graph gets lower as it moves toward the right, and it approaches the lower-right and upper-left corners of the graph. These slopes relate directly to the positive and negative relationships that we observe between variables in the social sciences. A positive relationship means that two variables go up or go down together. For example, the probability that a person develops a heart disease has a positive relationship with whether and how much the person smokes: the more the person smokes, the more likely it is that the person contracts a heart disease; the less the person smokes, the less likely it is that the person contracts a heart disease. A negative relationship means that as one variable increases, the other decreases. If the supply of a commodity goes up, the cost goes down, and if the supply goes down, the cost goes up. The other component of a linear function is the y-intercept.
MATHEMATICS FOR SOCIAL SCIENTISTS
60
Definition: y-Intercept The point on the y-axis occupied by a line when x = . Traditionally, the y-intercept is denoted b. In statistics, however, the y-intercept is usually denoted either α or β and is called either the intercept or a constant. If the slope is known and at least one point in the graph is known, the y-intercept can be derived by plugging m, x, and y into y = mx + b and solving for b.
With the slope and the y-intercept, one can graph the line, as in Figure 2.13. Figure 2.13 Identifying the Slope and y-Intercept of f(x) = x − x Run: (6 − 4) = 2 ● (6, 9)
10
10
x
y
y
5
● (4, 5)
f(x)
f(x)
5
Rise: (9 − 5) = 4
f (x)
f (x)
0
0
Slope: Rise / Run = 4/2 = 2
f (y)
● (0, −3)
y−intercept = −3
−5
−5
f (y)
−2
0
2
4
6
8
−2
1
0
2
4
6
8
xx
xx
y
y
1
Example 1. Find the linear function that passes through the points ( , ) and (− , ). Since we have two points, we can first determine the slope: f(x) m= f(y)
− = − (− )
=
f (x)
. f (y)
We know that a linear function takes the form y = mx + b, and we know that the line passes through ( , ), so we can solve for b by plugging in m, x, and y: y = mx + b, × + b,
=
1
1
= b=
+ b, −
=
.
So the linear function that passes through ( , ) and (− , ) is f(x) =
x+
.
CHAPTER 2. SETS AND FUNCTIONS
61
2.5.2 Higher-Order Polynomials In the following equations, let a , a , . . . , an be constants. Second degree polynomials are of the form f(x) = a x + a x + a and are also called quadratic functions. Their graphs are shaped like parabolas, as in the lefthand panel of Figure 2.14. When the coefficient a is positive, the parabola opens upward on the graph, like a smile, and when the coefficient a is negative the parabola opens downward, like a frown. The point at which a positive quadratic function is at its lowest point or a negative quadratic function is at is highest is called the vertex of the parabola. For any parabola f(x) = a x + a x + a , the vertex is the point with the following x value: a . x=− a
Figure 2.14 Graphs of an Example Quadratic, Cubic, and Quartic Polynomial x
x
Cubic
y
0
f(x)
f (x)
−50
5
f (x)
10
f(x)
15
50
20
y
100
25
Quadratic
−100
0
f (y)
−5
f (y)
−4
−2
0
2
4
−4
−2
xx
0
xx
x
Quartic
y
20
y
15
y 1
1
f (x)
5
10
f(x)
f (x) f (x)
f (y)
0
f (y) f (y)
−4
−2
0
2
4
xx
y 1
1
f (x)
1
2
4
MATHEMATICS FOR SOCIAL SCIENTISTS
62
The vertex of a parabola is an example of a critical point, which is the point at which a graph changes from moving upward to downward, or vice versa. The points at which the parabola crosses the x-axis (so that f(x) = ) are called the roots of the function.3 As discussed in Section 1.7.3, quadratic functions can have up to two roots but may have only one root or no roots at all. The roots can sometimes be found using factoring or else with the quadratic formula. A third-degree polynomial is called a cubic function, and a fourth-degree polynomial is called a quartic function. A cubic graph and a quartic graph are illustrated in the center and right-hand panels of Figure 2.14. Cubic graphs can contain up to two critical points and up to three roots. Quartic functions contain up to three critical points and four roots. See a pattern? In general, we can define an nth-degree polynomial to be of the form f(x) = an xn + an− xn− + · · · + a x + a x + a . We can rewrite this polynomial in summation notation, as discussed in Table 1.1 in Section 1.6: n ∑ f(x) = ai xi . i=
This nth-degree polynomial can have up to n − critical points and n roots.4 Example 1. Find a 10th-degree polynomial with 10 roots. This problem may seem overwhelming at first. Instead of guessing and checking every polynomial with x , we can just construct such a polynomial ourselves. One 10-degree polynomial with 10 roots is f(x) = (x − )(x − )(x − )(x − )(x − )(x − )(x − )(x − )(x − )(x −
),
which has roots at x = , , , , , , , , , and 10. This polynomial is of degree 10 since multiplying all the x terms together yields a term with x . The question does not require us to actually multiply this polynomial out, so let’s not! We can also easily construct a 10th-degree polynomial with nine roots by repeating one of the roots, as in f(x) = (x − ) (x − )(x − )(x − )(x − )(x − )(x − )(x − )(x − ). Likewise, a 10th-degree polynomial with only one root is f(x) = (x − ) . The following 10th degree polynomial, f(x) = x + , has no real-value roots since no value of x can make f(x) = .
3 This
term is an example of imprecise mathematical language since the word roots is also used for square roots, cube roots, and so on. To clarify, some mathematicians prefer the term zeroes to denote the points where the function crosses the y-axis. 4 Solving for these roots directly, however, is a big challenge in advanced computational mathematics and complex analysis for large polynomials. The research is fascinating but is well beyond the scope of this book.
CHAPTER 2. SETS AND FUNCTIONS
63
2.5.3 Linear Regression One of the most important methods in quantitative social science is linear regression. For most students in the social sciences, this topic will be covered in great detail in another methodology course. Regression is a method that carries its own terminology and logic, and for many students the speed with which the technical language is presented is overwhelming. But there’s very little more to a linear regression than the mathematics of linear functions presented in Section 2.5.1. In this section, I present the basics of linear regression as a small step beyond graphing y = mx + b. Linear regression is a method to analyze data. Data can be defined very broadly, but for our purposes we can say that data are values arranged in a table, called a data set.5 The rows of this table are called observations or cases, and the columns of this table are called variables. These basic elements of a data set are illustrated in Figure 2.15. Figure 2.15 The Basic Elements of a Data Set Variables
Observations, or cases
OBS 1 2 3 4 5 6 7 8
X1 3 . 1 0 7 8 6 5
X2 2 5 1 0 9 3 7 1
X3 6 1 . 4 0 2 4 2
Missing data points
Data points
Each observation should represent one unit of analysis, which is one object from the set of objects being considered. If a public opinion scholar conducts a survey of individuals in the United States, the unit of analysis is individuals from the United States, and each row in the data represents one individual. If a scholar collects aggregated data on European countries, then each row represents one European country. Each variable represents a distinct piece of information about the observation. So the public opinion researcher can record the age, sex, party affiliation, and political viewpoints of each subject, and these pieces of information are distinct variables. Likewise, a European politics scholar can collect the population, area, government type, and social policies of each European country.
5 The word data can be used as if it is singular or as if it is plural. That is, saying “the data is” and “the data are” are both appropriate. In official academic writing, however, more scholars tend to treat the word data as plural, so I use “the data are” here.
MATHEMATICS FOR SOCIAL SCIENTISTS
64
The values inside each cell in the table are called data points and refer to the value of one particular observation on one particular variable. Data points may be qualitative, logical, or numeric. Qualitative refers to textual information: a subject’s full name, his or her transcribed response to an open-ended survey question, the name of a country. Logical values are numbers or labels that express a category: male or female; a vote for Obama, for Romney, or for a third-party candidate, or no vote at all; an autocracy, a developing democracy, or an established democracy. Numeric values are numeric quantities: a subject’s age, a runner’s finishing time in a marathon, a country’s gross domestic product. Linear regressions work on numeric variables, and extensions of the regression framework (which you will learn about in other courses) can be used for logical variables. Very often, some data points are missing.6 The reasons why a data point may be missing vary from situation to situation. In a survey, income or voting information might be missing because the respondent refused to answer those questions. Countries may not report all the relevant economic information. Some values may be missing because the question doesn’t apply: There’s no sense in asking a single person how long he or she has been married or in counting the roll call votes in a country with no legislature. If the observations are a randomly drawn sample from a larger population, then the data can be used to make inferences about the population. A randomly selected sample of 1,000 American adults is usually sufficient to draw accurate inferences about the entire adult population of the United States. An example of data is presented in Table 2.4. Table 2.4
Example Data
Observation Left-Right Obama Observation Left-Right Obama Ideology Approval Ideology Approval 1
–2.50
9.04
9
0.36
2.94
2
–2.14
1.97
10
0.71
–4.06
3
–1.79
3.21
11
1.07
1.99
4
–1.43
8.07
12
1.43
–5.18
5
–1.07
4.38
13
1.79
–4.22
6
–0.71
0.74
14
2.14
–5.92
7
–0.36
–1.71
15
2.50
–10.89
8
0.00
–2.20
6 Different notations are used to denote a missing data point. Excel simply leaves a cell blank, Stata places a period inside a missing cell, and R denotes missing values as NA.
CHAPTER 2. SETS AND FUNCTIONS
65
This example is artificial, but it illustrates the concept of a linear regression very well. This is a survey of 15 individuals, denoted observations 1 through 15 in Table 2.4. Each observation should have its own row, but for space the table is broken into two halves and aligned side by side. Two variables are measured for each individual. First, the individual’s ideological position on a left–right dimension is measured. Negative values indicate liberal positions, positive values indicate conservative positions, and values farther from 0 indicate more extreme positions. The second variable is a similar index for how much the respondents approve of the way Barack Obama is handling his job as president. Positive values indicate approval, negative values indicate disapproval, and 0 is a neutral assessment. One of the most useful ways to visualize data is with a scatterplot. A scatterplot is a graph that contains the points that represent two variables. The x-coordinate represents the value of one variable (usually the independent variable, or the variable that is the “cause”), and the ycoordinate represents the value of the other variable (the dependent variable, or the “effect”). The data in Table 2.4 are graphed in a scatterplot in the left-hand panel of Figure 2.16. Figure 2.16 An Example of a Scatterplot and a Linear Regression
10
Best Fit Line
10
Scatterplot
●
●
●
0
●
●
● ●
−5
● ●
●
−3
−2
−1
0
1
2
● ● ●
●
● ●
● ● ●
−10
●
● ● ●
0
●
−5
Approval of Barack Obama
● ● ●
−10
Approval of Barack Obama
5
●
5
●
3
Ideology of the Respondent
●
−3
−2
−1
0
1
2
3
Ideology of the Respondent
Observation 1 in Table 2.4, for example, has an ideology score of − . and an Obama approval score of 9.04. That is, this person is very liberal and is also very supportive of Obama. We plot the point (− . , 9.04) in the scatterplot along with the other 14 points. The goal of a linear regression is to create a model of the data that can predict values of y given a value of x and to assess whether or not x has a real effect on y. We make the assumption that the two variables relate to each other in a linear way. That is, y is a function of x that takes the form y = mx + b. Having assumed that this model is linear, the task of a linear regression is to find the values of the slope and the y-intercept that provide the best possible fit for the data set as a
MATHEMATICS FOR SOCIAL SCIENTISTS
66
whole. There is only one value of the slope and only one value of the y-intercept that together produce this best-fit line. In this example, the best-fit line is overlaid on the scatterplot in the right-hand panel of Figure 2.16. The slope and the y-intercept of a linear regression are called the model parameters. There is a specific formula that can be used to estimate the values of the parameters that yield the best-fit line, but this formula involves linear algebra and is not discussed until Section 9.3. For now, we can assess whether one proposed slope and y-intercept are a better fit for the data than another proposed slope and y-intercept. First, let’s rewrite the linear model y = mx + b using the notation that is commonly used in statistics for linear regressions: yi = α + βxi . The slope is now denoted as β, and the y-intercept (more commonly referred to as the constant or just as the intercept) is now denoted α. The subscript i refers to the observation number, which makes it explicit that y depends on x , that y depends on x , and so on. The subscript is only used in order to be precise. This equation, however, is incomplete. Notice that the best-fit line in Figure 2.16 does not hit any of the points on the scatterplot. Indeed, no line can hit any more than two of the points on this graph, and any nonlinear function that hits every point must be a squiggly, useless mess. The line does not predict y given x, it predicts y with error. That is, the line misses each point by some distance either above or below. We must include these errors in our model. The regression model now becomes yi = α + βxi + ϵi , or more specifically, (Obama approval)i = α + β(Left–Right Ideology)i + ϵi , where ϵi is the vertical distance between the point for observation i and the line. These errors are called residuals. For example, the point for Observation 1, (− . , 9.04), is above the best-fit line. When x = − . , the line hits the point (− . , . ), so the residual for the first observation is ϵ = . − . = . . The second observation, (− . , 1.97), is below the best-fit line, which hits the point (− . , 6.08), so the residual is negative: ϵ = . − . = − . . In plain English, these residuals mean that the first individual approves of Obama 1.94 units more than the model predicts, and the second individual approves of Obama 4.11 units less than the model predicts. All of the residuals for this example are graphed in Figure 2.17. It so happens that the best-fit line in Figures 2.16 and 2.17 must have a slope of − . and an intercept of − . . The linear regression model is therefore (Obama approval)i = − .
− . (Left–Right Ideology)i + ϵi .
This model provides a predicted level of Obama approval for any given value of left–right ideology. When making predictions, we cannot ignore the residuals. However, we assume that for any individual, ϵ will on average be equal to 0, is just as likely to be positive as negative, and is more likely to be closer to 0 than farther away from 0.7 We observe no individual with an ideology score of − . But we can predict that such an individual would 7 Specifically, we assume that ϵ follows a normal distribution with a mean of 0 and a variance that is estimated from the data.
CHAPTER 2. SETS AND FUNCTIONS
67
Figure 2.17 Residuals From the Linear Regression
10
Residuals
●
ε1
●
5
ε2 ε3
●
●
ε5● ●
ε9
ε6
●
0
ε7 ●
●
ε11
ε8 ●
ε10
ε12 ε13●
−5
●
●
ε14● ε15
−10
Approval of Barack Obama
ε4
●
−3
−2
−1
0
1
2
3
Ideology of the Respondent
approve of Obama as follows: (Obama approval) = − .
− . (− ) + ϵ,
(Obama approval) = − .
+ ϵ.
That is, we predict that a randomly drawn individual with an ideology score of − will, on average, approve of Obama with a score of − . . The residual tells us that an individual is equally likely to approve of Obama more than − . or less than − . but is more likely to have an approval score close to − . than very far away from it. It is possible (in fact, essential) to estimate from the data just how much above or below the predicted mean a random individual may be, but this calculation is beyond the scope of this section. The estimates for the two parameters have meaningful substantive interpretations. The slope and the intercept from any regression in any situation have an interpretation that elaborates on these general sentences: Interpretation of a β parameter: If the model includes only one x variable: A one-unit increase in x is associated, on average, with a β change in y. If the model includes several x variables: A one-unit increase in x is associated, on average, with a β change in y, after considering all the other x variables in the model. Interpretation of an α parameter: If the model includes only one x variable: When x is equal to 0, y is predicted on average to be α. If the model includes several x variables: When all of the x variables are simultaneously equal to 0, y is predicted on average to be α.
MATHEMATICS FOR SOCIAL SCIENTISTS
68
As indicated by these statements, a regression model may contain multiple x variables. I discuss that briefly below. But for this example, which has only one x variable, these general sentences can be used to make specific statements about our regression model. For the slope: A 1-unit increase in conservatism is associated, on average, with a 2.89-unit decrease in the amount an individual approves of the way Barack Obama is handling his job as president. And for the intercept, remember that an ideology of 0 refers to an individual who is exactly moderate: An individual who is exactly moderate approves of the way Barack Obama is handling his job as president with an average score of − . units. As with the predicted values, these parameters are estimated with some amount of uncertainty. It is important to state the amount of uncertainty inherent in these estimates. It is often especially important to state whether you can confidently conclude that the slopes are different from 0, since a slope of 0 means that a change in x is associated with only random changes in y. You will learn the conventions for reporting these measures of uncertainty in different courses. So how do we know that a particular line is the best fit? There is a specific definition for the best-fit line. Definition: Best-fit line A linear model for a data set is the best-fit line if and only if the slopes and intercept are chosen such that the sum N ∑ ϵi i=
is smaller than the same sum calculated for a line with any other set of parameter estimates.
This sum, called the sum of squared residuals (SSR) or the sum of squared errors (SSE), is minimized for the best-fit line. Squaring the residuals ensures that positive and negative values do not cancel each other out, and it also penalizes large residuals more than small ones. If we change any slope or the intercept by any amount, the value of the SSR goes up. We can use this definition to prove that one line is a better fit for the data than another line. For the example of ideology and presidential approval, in which the best-fit line has a slope of − . and an intercept of − . , we list the residuals that are graphed in Figure 2.17. This list is given in Table 2.5. We can calculate the SSR as follows: ( . ) + (− . ) + (− . ) + ( . + (− . +( .
) +( .
) +( .
) + (− .
) + (− . )
) + (− . ) + ( . ) + (− . ) + ( . )
) + (− . ) =
. .
Now consider a second linear model that has a slope of − and an intercept of − . The line and the residuals from this line are illustrated in Figure 2.18 and listed in Table 2.6. Clearly this second model is a better fit for some observations, especially Observations 1, 10, and 15. But note that some of the residuals are much higher than they are under the best-fit model, especially Observations 13 and 14. If we calculate the new SSR, we find that the bad
CHAPTER 2. SETS AND FUNCTIONS
Table 2.5
69
Example Data, Fitted Values, and Residuals Observation Left–Right Obama Linear Residual Number Ideology Approval Prediction (ϵ) 1
− .
.
.
.
2
− .
.
.
− .
3
−.
.
.
−.
4
−.
.
.
.
5
−.
.
.
.
6
− .
.
.
−.
7
− .
−.
.
− .
8
.
− .
− .
− .
9
.
.
−.
.
10
.
− .
− .
−.
11
.
.
− .
.
12
.
− .
− .
− .
13
.
− .
− .
.
14
.
− .
− .
.
15
.
− .
− .
− .
Figure 2.18 Residuals From a Linear Model That Fits Less Well
10
Residuals
ε1● ●
5
ε2 ε3
ε5●
● ●
●
ε6
0
●
ε9
●
ε7 ●
ε11
ε8
●
−5
ε10●
● ●
ε12 ε13
●
ε14 −10
Approval of Barack Obama
ε4
ε15●
−3
−2
−1
0
1
Ideology of the Respondent
2
3
MATHEMATICS FOR SOCIAL SCIENTISTS
70
Table 2.6
Example Data, Fitted Values, and Residuals From a Model That Fits Less Well Observation Left–Right Obama Linear Residual Number Ideology Approval Prediction (ϵ) 1
− .
.
.
.
2
− .
.
.
− .
3
−.
.
.
− .
4
−.
.
.
.
5
−.
.
.
.
6
− .
.
.
− .
7
− .
−.
.
−.
8
.
− .
−.
−.
9
.
.
− .
.
10
.
− .
− .
.
11
.
.
− .
.
12
.
− .
− .
.
13
.
− .
− .
.
14
.
− .
− .
.
15
.
− .
− .
.
outweighs the good: ( .
) + (− . ) + (− . ) + ( . ) + ( . ) + (− . ) + (− . ) + (− . ) + ( . ) + ( . ) + ( . ) + ( . ) + ( . +( . ) +( . ) =
)
. .
Therefore, we have proven that the line with a slope of − . and an intercept of − . is a better fit for the data than the model with a slope of − and an intercept of − . I find that I personally cannot easily tell just by looking at the two residual plots which line is a better fit for the data. Most people will not be able to identify the best-fit line if they have to draw it or identify it from similar competitors. Minimizing the SSR, however, provides a systematic and unambiguous way to assess the fit of each line. Regressions can also be modeled using higher-order polynomials. Including a squared, cubic, or higher power term in the polynomial introduces a curvilinear effect, where the slope of the line changes with the value of x. Suppose that we model the income of individuals as a function of age. People make more money as they get older but make less money once they retire. A linear effect, therefore, is incomplete since the picture should look more like a parabola. We account for this relationship by including age in the regression along with
CHAPTER 2. SETS AND FUNCTIONS
71
age. In general, if a theory explicitly calls for curvilinearity, then a squared term is required. Models in the social sciences very rarely call for cubic terms or for higher powers. An excellent question about regression models is why some points are poorly modeled by the line. In Figure 2.17, Observation 4 is slightly more conservative than Observation 3, but Observation 4 has a much higher approval of Obama than Observation 3. Why does it make sense that movement in the conservative direction is associated with higher approval of Obama? There are two possible answers. The first is that everything in this situation is random. That is, we’ve identified an underlying model and stripped away unimportant variation due to randomness. A particular day in April may be warmer, for example, than a particular day in June. But by looking at the data in their entirety, we clearly observe that June is warmer than April. Likewise, a slightly more conservative individual may randomly be more approving of Obama than a slightly more liberal individual, but the data as a whole tell us that more conservative individuals definitely approve of Obama less. The second reason is more interesting and more troubling. It is very possible that we’ve missed some important part of the story. Maybe it is the case that while conservatives approve of Obama less, women approve of Obama more than men with the same ideology. If that is the case, and Observation 4 is a woman and Observation 3 is a man, then the unexplained variation in this part of the model is completely expected and is due to factors that are decidedly not random. In this case, the sex of the individual is an omitted variable. It is easy to imagine other omitted variables that may have an impact on Obama approval independently from ideology. Individuals may approve of Obama more if they are from Obama’s home state of Illinois, if they agree with the steps Obama has taken on a particular issue, or if they have received a larger tax refund. If a linear model fails to include variables that are also important, it can potentially lead to inaccurate predictions and parameter estimates.8 The regression model that has only one independent variable is sometimes called a simple regression. To consider the possibility of additional variables, we expand the framework of a linear regression to explain the variation of a y variable with many x variables that are simultaneously included in the model. While many xs are allowed, a regression only ever has one intercept. A regression that includes more than one x variable is called a multiple regression. A multiple regression model with five x variables has the following equation: yi = α + β x ,i + β x
,i
+β x
,i
+β x
,i
+β x
,i
+ ϵi .
The model is the same as the single-variable regression model, only now there are five separate x variables, subscripted 1 through 5, and five slopes, also subscripted 1 through 5. The multiple regression model still attempts to find the values of α, β , β , β , β , and β that minimize the SSR. Since there is still just one residual term for this model, the SSR is calculated in the same way as with the single-variable regression. By including more than one x, we look at each x variable’s impact on y after controlling for the other xs in the model. 8 Specifically, omitted variables are a serious problem for a linear regression when they are correlated with both the outcome and an independent variable. In this case, we cannot believe our estimate of the effect of the independent variable on the outcome because we cannot tell if X has an effect on Y or if a related but omitted variable Z is really driving this effect. For example, suppose that the real explanation of voters’ decisions is income and that income is highly correlated to ideology. If we do not include income in the regression, we will mistakenly ascribe the influence of income on voting to the influence of ideology instead.
MATHEMATICS FOR SOCIAL SCIENTISTS
72
For example, suppose a model of presidential approval considers an individual’s ideology, age, and income. The multiple regression considers the effect of ideology on approval within groups of individuals with the same age and income, the effect of age on approval within groups with the same ideology and income, and the effect of income on approval within groups with the same ideology and age. The mathematics that underlie the multiple regression model handle these multiple comparisons. Researchers do not need to run more than one model to derive all of this information, but they do need to interpret the slopes and intercept correctly, as described earlier in this section. The formulas for the slope and the intercept of the best-fit line of a simple regression are presented in Section 7.3.4. Multiple regression is discussed further in Section 9.3.
Exercises 1. It is a well-known fact that when a zombie bites a human, the human turns into a zombie. Suppose a mad scientist creates one zombie and due to a tragic but inevitable failure of the lab’s security system, the zombie escapes and begins biting people. (a) Consider the process that turns a human into a zombie. We can think of this process as a function. What set comprises the domain of this function, and what set comprises the range of this function? (Hint: Remember that a set is a collection of objects that are often but not necessarily numbers.) (b) Suppose that each zombie bites two people every day, so that each day the zombie population triples. Write a function f(x) that expresses the zombie population after x days. (c) Using this function, how many zombies are there 28 days later?9 (d) Suppose the first zombie is created and wreaks havoc inside the United States. The rest of the world, recognizing the gravity of the situation, bands together to close the borders of the United States in order to ensure that no zombies leave the country. Unfortunately, this means that no humans can escape the country either. As of 2015, the population of the United States is 320,213,084 people. According to the function you specified in (b), how long would it take for everyone to become a zombie? Write your answer first in terms of logarithms, and then as an approximate number. (Hint: Use the change-of-base formula to convert your answer into either a natural log or a common log, which a computer or a calculator will be able to evaluate.) 2. Using the following functions, √ f(x) = x , h(z) = e−z/ ,
g(y) = y − y + , k(w) = ln(w),
find the following functions. Simplify as much as possible.
9 See
https://www.youtube.com/watch?v=c7ynwAgQlDQ.
CHAPTER 2. SETS AND FUNCTIONS
73
(a) f(x ) (b) f− (x) (c) (f ◦ g)(y) (d) (g ◦ f)(x) (e) (f ◦ f)(x) (f) (h ◦ k)(w) (g) (k ◦ h)(z) 3. Consider the following function: ln(x − ) h(x) = √ . −x (a) In set builder notation, express the domain of h(x). (b) For what value of x is h(x) = ? (c) State whether true or false: The range of h(x) is the set of all real numbers, (−∞, ∞). Explain your answer. (Hint: This problem requires an external calculator of some sort. Try plugging in numbers that are very close to the end points of the domain.) 4. Create Venn diagrams that express the following set notational statements: (a) Ã ∩ B (b) A ∪ (B ∩ C̃) (c) A ∩ (B ∪ C̃) (d) (Ã ∩ B) ∪ (A ∩ C) 5. Provide a translation in plain language of the following set notational statements: ˜ where A is the set of all vegetables and B is the set of red foods. (a) A ∩ B, (b) Ã ∪ B, where A is the set of democratic countries and B is the set of economically underdeveloped countries. (c) [ , ) (d) ( ,∞) (e) {x ∈ Z | x ≥ } (f) {y ∈ R |
y
∈ Z}
6. Express the following statements using set builder notation: (a) The set of real numbers greater than or equal to − and less than 4. (b) The set of real numbers greater than 12. (c) The set of integers that are divisible by 3. (d) The set of real numbers that solve the the equation x − x +
= .
MATHEMATICS FOR SOCIAL SCIENTISTS
74
7. Consider a legislature with three members, named legislators A, B, and C. They are voting on spending bills that have two components: One part makes a proposal for social spending, and the other makes a proposal for defense spending. The status quo (denoted SQ in the figure below) is 2 for social spending and 2 for defense spending. Each legislator has an ideal point for social and defense spending. A prefers (1,1) for social and defense spending, respectively, B prefers (2,4), and C prefers (4,2). Each legislator will vote for a proposal that is closer to his or her ideal point than the status quo. As shown in the diagram, the set of proposals that a legislator will vote for is a circle, centered around the legislator’s ideal point, that passes through the status quo point. Let A be the set of proposals that legislator A prefers to the status quo, B be the set of proposals that legislator B prefers to the status quo, and C be the set of proposals that legislator C prefers to the status quo.
(a) The win set is the set of proposals that at least two out of the three legislators will vote for. Describe this set using set notation (intercepts, unions, parentheses, etc.). (b) Copy the above graph, and shade in the win set. 8. Create graphs for the following set notational statements or functions: (a) ( , ] (b) [− , ) ∪ [ ,∞) (c) f(x) = x + x − √ (d) g(y) = y + (e) h(z) = ln(z + ) ew (f) k(w) = + ew
CHAPTER 2. SETS AND FUNCTIONS
9. Create a function that meets the following criteria: (a) A linear function that passes through points (1,2) and (3,8) (b) A linear function with a y-intercept of − that passes through the point (3,1) (c) A linear function with a slope of − that passes through the point (− ,6) (d) A third-degree polynomial with roots at − , 1, and 3 (e) A fifth-degree polynomial whose only roots are − , 2, and 5 10. The article “Bargaining in Legislatures” by David P. Baron and John A. Ferejohn (1989) uses formal theory to describe how legislators act strategically to form agendas, pass legislation, and serve their districts.10 Baron and Ferejohn consider a legislature with three members: legislators 1, 2, and 3. The legislators are trying to decide how to divide a dollar between their districts. One of the three legislators is chosen randomly to make the first proposal. She proposes a distribution (x , x , x ), where each x is a monetary amount and the three xs cannot add to more than 1. The subscripts, 1, 2, and 3, denote the legislator who receives that portion of the dollar, so a proposal (.5, .25, .25) means that legislator 1 receives 50 cents, and legislators 2 and 3 each receive a quarter. The superscript 1 refers to the fact that this is the first offer made. If this proposal fails, another legislator will make a proposal denoted (x , x , x ). Baron and Ferejohn are trying to find the proposals that can be passed if all three legislators want to maximize their share of the dollar. They first note that any proposal should have the property that the three allocations add up to 1; otherwise they would be leaving money on the table. They define a feasible set of proposals, called A, which is defined as A ≡ {(x ,x ,x ) | x + x + x = }, the set of proposals such that the allocations add to 1. The ≡ sign means “equivalent to,” which for our purposes is the same as a regular equal sign. Suppose legislator 1 is chosen to make the first proposal. If this proposal fails, then the other two legislators have an expectation about what proposals they can make and pass in later rounds. For a legislator j, her expected allocation in future proposals is vj . But there is a penalty to everyone for failing to reach a decision right away. If the first proposal fails, the payoffs for everyone in the next round are multiplied by the discount factor δ, which is between 0 and 1. So if the legislators fail to pass a proposal in the first round, then they only earn (δx , δx , δx ) in the next round. On page 1193, Baron and Ferejohn are trying to find the set of proposals that legislator 1 can make that can pass the legislature unanimously. They call this set U, and they denote this set using set builder notation: U ≡ {(x , x ) | xj ≥ δvj , j = , , x + x ≤ } ∩ A. Your task in this problem is to translate this statement into plain English. Be clear and be precise about how every symbol in this statement is translated. 10 Baron,
D. P., & Ferejohn, J. A. (1989). Bargaining in legislatures. American Political Science Review, 83(4), 1181–1206.
75
MATHEMATICS FOR SOCIAL SCIENTISTS
76
11. Consider the following data: Observation 1 2 3 4 5 6 7 8 9 10
Income 59,800 85,800 64,900 51,200 55,200 100,100 71,300 90,300 75,600 80,100
Education Sex 12 1 15 0 13 1 8 0 11 1 18 0 14 1 16 0 15 1 14 0
The first variable, income, measures the annual income in dollars of the ten people surveyed. The second variable, education, is the number of years of formal education that each respondent has had. The last variable, sex, is 0 for men and 1 for women. (a) Create a scatterplot with education on the x-axis and income on the y-axis. After creating the scatterplot, draw your best guess for what the best-fit line through these data should be. (b) Draw the scatterplot from part (a) again, only this time use dots for the observations that are men and squares for the observations that are women. Draw your best guess for what the best-fit line through these data should be for just the men, and then for just the women. How do these two lines compare with one another? What does this comparison tell you about the relationship between sex and income after accounting for education? (c) Suppose that the best fitting linear regression model for these data is Incomei = 11,356 + 4,939 Educationi − 10,174 Sexi + ϵi . Provide a substantive interpretation of the constant, the coefficient on education, and the coefficient on sex. (d) Using the regression model in part (c), predict the income of a woman with 17 years of formal education.
Probability
3
M
is the scientific language of certainty. But statistics is the mathematics of uncertainty. We use probabilities to represent exactly how uncertain we are. (Since we apply a certain language to uncertain situations, we are certain about how uncertain we are, if that makes sense.) In the social sciences, we are often, if not always, uncertain about the situations and phenomena we study, and statistics and probability help us deal with this uncertainty.
3.1 Events and Sample Spaces Probability theory is an extension of set theory and functions, discussed in Chapter 2, and uses much of the same notation and terminology. A probability function looks like P(A) = p, where A is a set of possibilities and is called an event. The elements of A are outcomes, and p is a probability, where ≤ p ≤ . The closer p is to 1, the more certain we are that one of the outcomes in event A will occur. If P(A) = , then event A is certain, and if P(A) = , then event A is impossible. What does a probability mean? A quick aside: There are two ways to think about probabilities. Suppose that I tell you that a particular event has a .35 probability of occurring. One way to interpret this statement is as a frequency. Frequentists may conceive that, in a hypothetical setting where we can rewind history and replay the course of things, the event will occur in 35 out of every 100 realities. To frequentists, randomness is part of the natural course of the universe. Some processes are simply random, and while we may be able to explain how some factors make an event more or less likely, some part of the causal process is always pure chance. The other way to think about probability is as a measure of not frequency but belief. Bayesianists see the world as set but consider our perceptions of it to be variable. To Bayesianists, it makes no sense to think of probability as frequency because we cannot rewind history. Our reality is the only one we have, but we can use all the information that exists in this reality to update our beliefs about the likelihood of an event. A .35 probability means that we doubt that an event will occur but we are more open to the idea than we would be if the probability were .25, for example. If we observe new information that favors the event, we might update our beliefs and increase this probability.
77
MATHEMATICS FOR SOCIAL SCIENTISTS
78
The division between both ways of interpreting probabilities has been overexaggerated. Empirical methodologists sometimes even identify themselves as either “frequentists” or “Bayesianists.” But the truth is that the debate has more relevance for philosophy than for applied research. It will be useful to understand both ways of thinking about probability, and either way the mathematical techniques presented here are the same.
Every event is a subset of a sample space, which is the set of all possibilities. A sample space is analogous to the universal set described in Section 2.1. Here we denote the sample ˜ consists of everything in S that is not space as S. For all events A, the complement event, A, in A. A sample space can be written in many ways, just so long as all possibilities are accounted for. It is a good idea to write the sample space in the way that best goes with the research question being considered. For example, suppose I roll a die. If I am interested in the probability of rolling a certain number, I can write the sample space as S = { , , , , , }, but if I am interested in the probability of rolling an even number, I can write S = {even, odd}. For a more relevant example, suppose we are uncertain about the future outcome of an election between two candidates a and b. If we are just interested in who wins, the sample space is S = {a wins, b wins}. But if we are interested in the margin of victory rounded to the nearest percentage point, we can write the sample space as S = {a wins by 1, b wins by 1, a wins by 2, b wins by 2, . . . , a wins by 100, b wins by 100}. However you choose to write the sample space, be sure that it is exhaustive of all the possibilities, and then stick with that configuration throughout your analysis. A set is discrete if we can count the elements, and continuous if we cannot count the elements. We can count discrete elements because there is nothing in between two elements for us to consider. We can count the integers (although we would be counting forever) because there is nothing to consider between elements 1 and 2, but we cannot count the real numbers because we would have to consider infinitely many numbers between 1 and 2. Therefore, the set of integers, Z, is a discrete set, and the set of real numbers, R, is a continuous set. If the sample space is discrete, the probability function is called a probability mass function (PMF). If the sample space is continuous, then the probability function is called a probability density function (PDF). PMFs and PDFs are discussed in much greater detail in Section 6.6.
3.2 Properties of Probability Functions Let set A be the event of any of k mutually exclusive (meaning that if one occurs, a different outcome cannot occur) outcomes occurring: A = {a , a , . . . , ak }.
CHAPTER 3. PROBABILITY
79
The probability of event A occurring is the sum of the probabilities of the individual outcomes: k ∑ P(A) = P(a ) + P(a ) + · · · + P(ak ) = P(ai ). i=
Since the sample space contains all possible outcomes, we know that the event of the sample space must occur with certainty: P(S) = . Likewise, the empty set by definition has no outcomes, so the probability of an empty event is 0: P(∅) = . These last two statements simply say that we are certain that something will happen and that we are certain that it cannot be the case that nothing happens. Example 1. You are playing Yahtzee. You’ve rolled four of your five dice, and you’ve gotten a 2, a 3, a 4, and a 5. You are going to roll your fifth and final die. What is the probability that you roll a straight? The event that results in rolling a straight is A = {roll a 1, roll a 6}, since then you end up rolling (1, 2, 3, 4, 5) or (2, 3, 4, 5, 6). The probability of one of these two outcomes occurring is equal to the sum of the probability of each outcome: P(A) = P({roll a 1, roll a 6}) = P(roll a 1) + P(roll a 6) =
+
=
.
The sample space consists of six possible outcomes, S = {roll a 1, roll a 2, roll a 3, roll a 4, roll a 5, roll a 6}, and since these are all of the possible outcomes, we are certain that something in the sample space will occur: P(S) = P({roll a 1, roll a 2, roll a 3, roll a 4, roll a 5, roll a 6}) = P(roll a 1) + P(roll a 2) + P(roll a 3) + P(roll a 4) + P(roll a 5) + P(roll a 6) =
+
+
+
+
+
= .
Finally, an outcome of rolling nothing, in this case, is impossible. So the probability of the empty set is 0.
3.2.1 Equally Likely Outcomes Suppose the sample space consists of k outcomes, S = {s , s , . . . , sk },
MATHEMATICS FOR SOCIAL SCIENTISTS
80
and that every outcome in the sample space has an equal probability of occurring: P(si ) = P(sj ) for all si ,sj ∈ S. In other words, if two outcomes are possible, then these two outcomes are equally likely. In these cases, the probability is said to be distributed uniformly across the outcomes.1 Lottery numbers, for example, are a uniformly distributed set of outcomes because every number is equally likely to be drawn. In the social sciences, we often aim to draw a sample uniformly from a population. For a survey of U.S. adults, every adult in the United States should be equally likely to be included in the sample. When all outcomes are equally likely, we have a simple way to derive the probability of an event. Recall from Table 2.1 in Section 2.1 that the notation |A| refers to the number of elements in set A. The probability of event A is the number of outcomes in the event divided by the number of possible outcomes: P(A) =
|A| . |S|
So the trick is counting the number of elements in the event and in the sample space. Sometimes the outcomes can be easily counted, but sometimes counting is a much more difficult problem. In fact, there is an entire literature in mathematics to deal with this kind of counting. Some advanced counting techniques are discussed in Section 3.3. Example 1. According to the 2010 census, the U.S. population is 308,745,538, and the number of people of age 65 or over is 34,991,753. If a researcher were to choose one individual in the United States at random to interview, what is the probability that this person is 65 years or older? The event A contains every person who is 65 or older, so |A| = 34,991,753. The sample space contains the entire U.S. population, so |S| = 308,745,538. The probability of selecting one person of age 65 or older at random is therefore P(A) =
|A| 34,991,753 = =. |S| 308,745,538
.
Example 2. Blackjack is a casino game in which each player competes against the dealer to see who can get the value of his or her cards closer to 21 without going over. Numerical cards are worth their numbers: 2s are worth 2, 3s are worth 3, and so on. Face cards (the jack, queen, and king) and the 10 are worth 10. Aces are worth either 1 or 11, whichever makes for a better hand.
1 Uniform distributions are not very common in nature. In the social sciences, for example, we cannot make a claim that things like public opinion, most demographics, national wealth, and so on, have outcomes that are all equally likely. Uniform distributions are most useful in simulations for generating random numbers and for surveys in which a sample is drawn uniformly from a population.
CHAPTER 3. PROBABILITY
81
Initially, every player is dealt two cards, and if a player has a low value, he or she can choose to “hit” and receive another card. On a blackjack table, the dealer and several players all have cards on the table, and eight decks of cards are typically used. One of the dealer’s cards is dealt face down. Crafty players can use all this information to their advantage. Suppose that you have been dealt an 11, and eight people are playing and no one has yet hit. Given the cards on the table, what is the probability of hitting and getting a face card (or a 10) to make a 21? The probability of drawing a face card is P(F) =
|F| , |S|
where |F| is the number of face cards left in the deck and |S| is the number of cards left in the deck. Both of these values depend on the number of cards and face cards already on the table. The total number of cards in eight decks is × = . Subtract from that the number of cards on the table, which is 16 for eight players plus 1 visible card from the dealer. So |S| = cards are left in the deck. How many of these cards are face cards? In each deck, there are 16 face cards (the 10, jack, queen, and king from each of the four suits). In eight decks, then, there are face cards. Let f be the number of face cards on the table. f may be as low as 0 or as high as 15 (since neither of your own cards is a face card). The number of face cards is |F| = − f, and the probability of drawing a face card is −f . P(F) = Plugging in each possible number of face cards, the probabilities are as follows:
f
P(f)
f
P(f)
0
/
=.
8
/
=.
1
/
=.
9
/
=.
2
/
=.
10
/
=.
3
/
=.
11
/
=.
4
/
=.
12
/
=.
5
/
=.
13
/
=.
6
/
=.
14
/
=.
7
/
=.
15
/
=.
3.2.2 Unions of Events As described in Section 2.3, there is an analog between the set notations denoting union and intersection, and the logical concepts of “or” and “and.” The union of events A and B is denoted A ∪ B, but the probability P(A ∪ B) means “the probability of event A or event
MATHEMATICS FOR SOCIAL SCIENTISTS
82
B occurring.” Likewise, the intersection of two events is denoted A ∩ B, but the probability P(A ∩ B) means “the probability of event A and event B both occurring.” The probability of the union of two events is the probability of the first event plus the probability of the second event minus the overlap between the two: P(A ∪ B) = P(A) + P(B) − P(A ∩ B). We have to subtract the overlap once because we account for it twice by adding the probabilities for each event together. Figure 3.1
An Illustration of the Union of Overlapping Events Event A Only
Event A
Event B Only
Event B
Event A
Event B
The Union of A and B
Event A
Event B
In Figure 3.1, two events A and B are represented as sets in a Venn diagram. Notice that if we add the shaded area for event A and event B, we have double-counted the area of the intersection. So to accurately measure the shaded area, we have to subtract the area of the intersection. If A and B are mutually exclusive, however, then they share no outcomes and
CHAPTER 3. PROBABILITY
83
their intersection is empty, so P(A ∩ B) = , and P(A ∪ B) = P(A) + P(B). Example 1. Suppose there are 218 sunny days a year in Charlottesville and 172 days in which the temperature is above 75 degrees. Of these, 115 days are both sunny and above 75 degrees. What is the probability that a randomly chosen day is either sunny or above 75 degrees? Let event A be that a day is sunny, and let event B be that a day is above 75 degrees. The size of the sample space is 365 (supposing that this isn’t a leap year). The probability that the day is sunny is |A| = =. , P(A) = |S| and the probability that the day is warmer than 75 degrees is P(B) =
|B| = |S|
=.
.
It is a mistake to add these two probabilities together to derive the probability of the day being either sunny or warm. In other words, P(A ∪ B) ̸= P(A) + P(B) = .
+.
= .
.
This probability is impossible because it is greater than 1. Specifically, it is too high because we’ve double-counted the days that are both sunny and warm. The probability that a day is both sunny and warm is P(A ∩ B) =
=.
.
So the correct probability of the day being sunny or warm is P(A ∪ B) = P(A) + P(B) − P(A ∩ B) = .
+.
−.
=.
.
3.2.3 Independent Events Many methods in the social sciences are built on assumptions that two particular events are independent or are conditionally independent from each other. These two terms deserve explicit definitions. Definition: Independence Two events A and B are independent if the probability of one does not depend on or vary with the probability of the other. Definition: Conditional independence Two events A and B are conditionally independent if the probability of one depends on or varies with the probability of the other only through the coincidence that both events are related to a third event C. If C is accounted for, then A and B are independent.
84
MATHEMATICS FOR SOCIAL SCIENTISTS
There isn’t always an explicit test to prove that two events are independent. Often, researchers must consider whether the assumption of independence or conditional independence is reasonable. Consider the following examples: • Events: one roll of a die; a second roll of the same die These events are independent. The die does not have a memory of its past rolls. If a 6 comes up on the first roll, then a 6 is just as likely to occur on the second roll as it would be had 6 not been rolled the first time. • Events: a country has a nondemocratic government; the country violates the human rights of members of its population Most researchers would consider these events to be dependent since some feature of nondemocratic governance often allows the government to commit human rights violations. • Events: sales of ice cream increase; sales of sunblock increase These two events are conditionally independent on warm, sunny weather. They are not strictly dependent because there’s no mechanism that causes people who are eating ice cream to want to cover their skin in protective lotion, nor is there a chemical effect of sunblock that causes people to crave ice cream. • Events: a Democrat wins the presidency; Democrats win a majority of seats in Congress These two events are not independent because if one occurs, the other event is more likely to occur. The events, however, might be considered conditionally independent, where the intervening event is that the electorate tends to prefer Democrats in this particular election. One important example of how social scientists assume independence is linear regression, discussed in Section 2.5.3. Linear regressions make the assumption that the residuals are independent from one another. Whether one residual is big or small, above the best-fit line or below it, must not affect whether another residual is big or small, or positive or negative. Sometimes this assumption is a suspicious one, especially if the data describe a process that varies over time. There exist tests of nonindependence of the residuals, and a few methods to correct a linear regression when the residuals are dependent. These topics will be covered in a detailed course on regression methodology. If two events A and B are independent, then the probability of their intersection is the product of the probabilities for the two events: P(A ∩ B) = P(A)P(B). This formula simplifies a number of situations in which probabilistic models are used to describe phenomena in the social sciences. As a researcher, this assumption should be an appealing one for you to make, but use caution, and consider whether the assumption is substantively defensible. Example 1. What is the probability of rolling a yahtzee (all five dice with the same value) in one roll in Yahtzee? There are six ways to roll a yahtzee. Let Y be the event of rolling a yahtzee, and let Y be the event in which all five dice are 1, Y be the event in which all five dice are 2, and so on. A yahtzee
CHAPTER 3. PROBABILITY
85
occurs when one of these six events occurs. Therefore, we are interested in deriving P(Y) = P(Y ∪ Y ∪ Y ∪ Y ∪ Y ∪ Y ). The events Y , Y , Y , Y , Y , and Y are mutually exclusive since there’s no roll of the five dice that can be all 1s and all 2s, for instance. Using the formula for mutually exclusive events from Section 3.2.2, we can rewrite the probability as P(Y) = P(Y ) + P(Y ) + P(Y ) + P(Y ) + P(Y ) + P(Y ) = P(Y ), since these six events are equally likely. Let d be the event that die 1 is a 1, d be the event that die 2 is a 1, and so on. The event Y occurs only when d , d , d , d , and d all occur. So P(Y ) = P(d ∩ d ∩ d ∩ d ∩ d ). Since d , d , d , d , and d are individual rolls of a die, they are independent. So we can simplify the probability as P(Y ) = P(d ) × P(d ) × P(d ) × P(d ) × P(d ). Since each of P(d ), P(d ), P(d ), P(d ), and P(d ) equals / , the probability of a yahtzee of all 1s is P(Y ) = × × × × = = −. Plugging this probability into the equation for P(Y), we get P(Y) = P(Y ) =
×
−
=
which means that we expect about one yahtzee for every
−
=.
,
≈ 1,300 rolls of the five dice.
Casinos, incidentally, make quite a lot of money off of individuals’ lack of intuition about the independence of events. Roulette, for example, is a game in which a ball is placed onto a spinning wheel and gamblers place bets on the spot where the ball ends up. Most roulette tables also have an electronic board that lists the results of the last several spins. A gambler might look at the board, see some pattern, and make a bet using the idea that the next spin somehow depends on the previous ones. If all these previous bets landed on black numbers, the gambler might bet on red because red is “due.” But the roulette wheel itself has no memory of what the previous results had been. Each spin is an independent event, but imagining that there is some dependence encourages the suckers. The same faulty logic applies to slot machines, which some people imagine to be hot or cold, or to dealers who supposedly cause good luck or bad luck. In fact, the more mathematics you study, the less likely you will be to enter a casino.2
3.2.4 Complement Events For any event A, we know that A and its complement A˜ are mutually exclusive. Furthermore, if A does not occur, then A˜ by definition must occur. Therefore, we know that ˜ S = A ∪ A, 2 Studying
math and choosing not to gamble are not independent events.
MATHEMATICS FOR SOCIAL SCIENTISTS
86
which means that ˜ = , P(S) = P(A ∪ A) and therefore ˜ = P(A) + P(A) ˜ − P(A ∩ A) ˜ = P(A) + P(A) ˜ = . P(A ∪ A) ˜ from both sides of the last equation, we derive the following important Subtracting P(A) relationship between the probability of an event and its complement: ˜ P(A) = − P(A). Many questions involve finding the probability of some event. Often, it is easier to find the probability of the complement event and to subtract that probability from 1. If a problem uses the phrase “at least,” that is a clue that complements may be helpful. Example 1. Suppose that an international treaty between 30 countries succeeds only if every country that signs the treaty fulfills its part of the agreement. Suppose further that each country has a .01 probability of defecting from the agreement and that each country makes this decision independently of the other countries. What is the probability that the treaty fails? The question can be rephrased to “What is the probability that at least one country defects from the agreement?” There are many ways in which at least one country can defect: Only one country can defect, two countries can defect, three countries can defect, and so on. If we calculate the probability of defection directly, we have to consider each of these cases and the calculation becomes very cumbersome. An easier way to calculate the probability is to consider the complement event. If A is the event that at least one country defects, then A˜ is the event that no countries defect and the treaty succeeds, and ˜ P(A) = − P(A). A˜ is the intersection of the events in which each country individually fulfills its commitment. Each country has a .99 probability of fulfilling the agreement, and since these probabilities are independent, this probability is ˜ =. P(A) =. . The probability that the treaty fails is then P(A) = − .
=.
.
Example 2. Consider a bright undergraduate senior who is applying to graduate schools. She applies to 10 competitive programs and estimates only a .1 probability for each school of being accepted. This low estimate may cause many people to doubt their prospects, but she does a quick calculation. Assuming acceptance at each school is independent from acceptance at the others, the probability of getting in somewhere is the complement event of getting in nowhere. Rather than calculating the probability of getting into 1 school, 2 schools, 3 schools, and so on, it is easier to calculate the probability of the complement. The probability of rejection at each school is .9, but the probability of rejection at all 10 schools is . = . . Therefore, the probability of getting in somewhere is − . = . , and things are looking up.
CHAPTER 3. PROBABILITY
87
3.3 Counting Theory This section refers specifically to the formula for the probability of events with equally likely outcomes, listed in Section 3.2.1: |A| , P(A) = |S| where |A| is the number of outcomes in the event, and |S| is the total number of possible outcomes. For these problems, counting the number of outcomes in each event constitutes most of the work. Several techniques presented below are helpful for counting large numbers of objects and solving these problems.
3.3.1 Multiplication When an event takes place in several stages, we can multiply the number of ways each stage can occur to find the sizes of the event and the sample space. To find the size of the sample space, multiply the number of possibilities in each stage together. To find the size of the event, multiply the number of possibilities with the desired characteristics in each stage together. Example 1. A committee must consist of a governor, a senator, and a Supreme Court justice. Committee members are drawn out of a hat (so that all outcomes are equally likely). What is the probability that everyone on the committee represents the 9th circuit (California, Oregon, Washington)?
• First, count the number of elements in the sample space. There are 100 possible senators, 50 governors, and nine justices. |S| =
×
×
= 45,000.
• Next, count the number of elements in the event. There are 6 senators, 3 governors, and one justice who represent the 9th circuit. |A| =
× × =
.
• So now we can obtain P(A): P(A) =
|A| = =. |S| 45,000
.
3.3.2 Factorials The factorial of a positive integer is the product of all positive integers less than or equal to that number. We denote the factorial of a number with an exclamation point. Formally, n! =
n ∏
i = n × (n − ) × (n − ) × · · · × × ,
i=
∏ where is the product sign from Section 1.6, which is similar to a summation, only the terms are multiplied instead of added. So, for example, !=
× × × × × =
.
MATHEMATICS FOR SOCIAL SCIENTISTS
88
Factorials are useful for counting problems in which every member of a group is placed in some order. If 10 people are going to stand in a line, then there are 10 people who can stand first in line, then 9 who can be second, 8 who can be third, and so on, until there is only 1 person who can be last. The number of possible lineups then is ! = 3,628,800 since all the stages are multiplied together. By convention, factorials are only defined for positive integers.3 The factorial of 0 is defined to be 1, so that ! = . Example 1. In a party list election, the first candidate on a party’s list is the first one awarded a seat in parliament. One particular party cannot choose between its best seven members, named candidates A through G, so it ranks them randomly. What is the probability that candidate F is listed first and candidate G is listed second?
• The number of elements in the sample space is !, since we are considering the number of orderings of seven candidates.
• To find the number of outcomes in the event, set F to the first spot and G to the second spot. Then there are five ways to choose the third spot, four for the fourth, and so on. So there are ! elements in the event. • The probability is P(A) =
|A| ! = = |S| !
× × × × = × × × × × ×
×
=
.
3.3.3 Combinations and Permutations Combinations and permutations count the number of unique small groups of size r that can be drawn out of large groups of size n. () Combinations, which are also called binomials, are denoted nCr or nr and are pronounced “n choose r.” Permutations are denoted nPr. The difference between the two operations is that combinations suppose that order does not matter inside the small group and permutations suppose that order does matter. For example, how many unique small groups of two are there in a pool of three? If order doesn’t matter there are three ways—1 and 2, 1 and 3, 2 and 3—so the combination should be equal to 3. If order does matter, there are six ways—1 and 2, 1 and 3, 2 and 3, 2 and 1, 3 and 1, 3 and 2—so the permutation should be equal to 6. Mathematically, a combination is defined as ( ) n n! = . r r!(n − r)! A permutation is defined as nPr =
n! . (n − r)!
3 Number theorists have developed a “gamma function” to generalize factorials to all real numbers, but that topic is well beyond the scope of this discussion.
CHAPTER 3. PROBABILITY
89
Simply plug in n (the size of the large group) and r (the size of the small group). For the above example, we can set n = and r = and compute ( ) ! × × = = = = , !( − )! ( × )( ) and P =
! = ( − )!
= .
Example 1. What is the probability of being dealt a royal flush?
• The event space consists of four elements, since there are four possible royal flushes (♣,♢,♡,♠). • The sample space is the number of ways 5 cards can be drawn out of 52. Order doesn’t matter, so we use a combination.
( ) |S| =
=
• So the probability is P(A) =
! = 2,598,960. ( !)( !)
|A| = =. |S| 2,598,960
.
I think there’s a better chance of getting struck by lightning.
Example 2. The United Nations (UN) intends to appoint a task force on human rights by randomly choosing 4 member nations to serve on the committee. There are 192 member nations of the UN. The first nation chosen will be the chair of the committee, and the other nations will rank in the order they are chosen. Higher-ranking nations can veto the proposals of lower-ranking nations. What is the probability that China and the United States are both on the committee and that China can veto the United States?
• The sample space is the number of ways to choose 4 nations out of 192, where order matters. We use a permutation. P =
(
! = − )!
! = 1,316,891,520. !
• There are six ways for China to outrank the United States on the committee: China as chair and the United States ranking 2, 3, or 4; China ranked 2 and the United States ranked 3 or 4; and China ranked 3 and the United States ranked 4. The number of committees that satisfy each of these six situations is P since we are now choosing 2 ordered members from the remaining 190. • The event space, therefore, is ( ) ! ( P)= = 215,460. !
• The probability of the event, therefore, is P(A) =
|A| = |S| ,
, ,
,
=.
.
MATHEMATICS FOR SOCIAL SCIENTISTS
90
3.4 Sampling Problems Another version of the problems discussed above concerns certain types of events within the sample space. Very commonly, a question will ask for the probability of a certain number of outcomes of one type and outcomes of another. In the social sciences, these problems occur often in survey methodology and sampling, where researchers want to make sure that a random sample of respondents includes enough people from different demographic groups. There are two ways to take a random sample from a population: sampling without replacement and sampling with replacement. Sampling without replacement assumes that once an outcome occurs, such as the selection of a particular respondent, then there is no chance that it can occur again. If we take a ball out of a box and leave it on the table, then we can’t take that ball out of the box again. Sampling with replacement makes no such assumption; we put the ball back into the box.4 In this section, we will consider problems that sample from a population and consider how groups within a population are represented in the sample.
3.4.1 Sampling Without Replacement The U.S. Senate has 100 members. In 2013, 46 of these senators were Republicans, 52 were Democrats, and 2 were Independents. The two Independents both caucus with the Democrats, so for all intents and purposes, there are 54 Democrats and 46 Republicans. Suppose members are chosen randomly to a 10-person emergency committee. What is the probability that 5 members are Democrats and 5 are Republicans? For problems like this one, it helps to think about different kinds of outcomes. In mathematics, one conventional way to differentiate between outcomes is to call some of them “good” outcomes and the others “bad” outcomes. These labels mean nothing; they only serve to make the divisions between the outcomes clear. We can reverse the labels, and none of the math changes. Suppose that there are G good outcomes in the sample space and B bad ones. Since all outcomes in the sample space are either good or bad, |S| = G + B. When sampling without replacement, the probability that the sample contains g good elements and b bad ones is (G)(B) g
b
P(g good, b bad) = (G + B) . g+b
For the Senate problem described above, the probability of choosing 5 Democrats and 5 Republicans is the number of ways to choose 5 Democrats out of the 54 Senate Democrats times the number of ways to choose 5 Republicans out of the 46 Senate Republicans, all over the number of ways to choose 10 Senators out of the 100 members. Plugging these numbers
4 Sampling with replacement is the basis of an advanced statistical method called bootstrapping. The idea is that if we take many small samples with replacement from a larger population and calculate a statistic for each sample, then the mean and standard deviation of the sample statistics will converge to the value and standard error of the statistic for the population.
CHAPTER 3. PROBABILITY
91
into the formula gives us an answer:
( )( )
P( Democrats, Republicans) =
( ) ×
=
=.
.
3.4.2 Sampling With Replacement Suppose that the Senate committee discussed in the previous section ends up with six Democrats and four Republicans. Every day for a week they draw strings to see who will be the chair of the committee, to set the agenda for committee votes and to report to the Senate. Everyone on the committee is given a chance to be the chairperson regardless of whether they have served before. What is the probability that the Republicans hold the chair for 4 days and the Democrats hold it for 3 days? Since this situation allows the same outcome to occur more than once (one person can be the chair several times), the sample drawn is with replacement. Consider just Day 1 for a moment. The probability of a Democrat chair is / , and the probability of a Republican chair is / . On Day 2, these probabilities are exactly the same since everyone, including the Day 1 chair, has an equal shot. The same applies for Days 3 through 7. One way that the Republicans hold the chair for 4 days and the Democrats hold it for 3 days is if Democrats serve as chairs the first 3 days, followed by 4 days of Republican chairs. The probability of this event is simple to calculate: ×
×
×
×
×
×
=.
.
This probability seems very low, but note that it is only one way in which the Republicans and Democrats can split the chair 4 to 3. What if the Republicans held the chair before the Democrats? More likely, what if they were more mixed up? The number of ways to split 4 to 3 is equal to the number of ways to choose 4 out of 7.5 The probability of a particular configuration of the sample is the individual probabilities for each stage multiplied together times the number of ways to rearrange these probabilities. Therefore, the probability of splitting 4 to 3 is actually ( ) × × × × × × × = ×. =. , which seems more reasonable. These probabilities follow the binomial distribution, described below. In the formula, let n be the size of the sample to be drawn, and let k be the desired number of a given type in the sample. The difference n − k is the number of the other type in the sample. Also, let p be the probability in one stage of choosing an outcome of the desired type. The probability of choosing the other type is, therefore, − p. The general formula for sampling 5 Or, equivalently, the number of ways to choose 3 out of 7. These quantities are precisely equal. In general, () ( n ) () ( ) () binomials have the property that nk = n−k , so = − = .
MATHEMATICS FOR SOCIAL SCIENTISTS
92
with replacement is P(k good) =
( ) n k p ( − p)n−k . k
For the above example, n = . Considering the Republicans, we set k = , which means that p = . Plugging in the numbers, we have ( ) P( Republicans) = ×. ×. =. .
3.5 Conditional Probability A conditional probability is the probability of some event B given that some other event A has already occurred. Definition: Conditional probability The probability of an event B conditional on an event A is denoted P(B|A) and is pronounced “the probability of B given A.” Mathematically, a conditional probability is defined as follows: P(B|A) =
P(A ∩ B) . P(A)
Multiplying both sides of the definition of conditional probability by P(A), another important version of this definition is P(A ∩ B) = P(B|A)P(A). Notice that if A and B are independent, then P(A ∩ B) = P(A)P(B), so P(B|A) =
P(A ∩ B) P(A)P(B) = = P(B). P(A) P(A)
In other words, if A and B are independent, information given about A tells us nothing new about the probability of B. Conditional probability works by reducing the size of the sample space as we learn more about what has already happened and what may still happen. “Given A” means that we treat A as if it has already happened. We only consider the portions of the event and sample spaces that are contained within A. When we condition on A, we narrow down reality, so we know that everything that will happen from now on must be inside A. Conceptually, imagine a Venn diagram with sets A and B intersecting in a larger box S. Now imagine zooming in on A so that you can only see up to the borders of A. This zoomed-in picture is the new situation. We can think of any set B as being broken into two parts: (1) the part of B that intersects another set named A and (2) the part of B that does not intersect A. If A and B are mutually exclusive, then they do not intersect; the first part of this division of B is empty, and the second part contains the entirety of B. But if A and B intersect, B is divided into two distinct parts. For any sets A and B, the following property is true: ˜ B = (B ∩ A) ∪ (B ∩ A).
CHAPTER 3. PROBABILITY
93
That is, B is the union of the part of B that intersects A and the part of B that intersects the complement of A. Since A and A˜ are mutually exclusive, we can apply this property to probabilities: ˜ P(B) = P(B ∩ A) + P(B ∩ A). Using the definition of conditional probability, we can rewrite this property: Using the definition of conditional probability, we can rewrite this property: ˜ A). ˜ P(B) = P(B|A)P(A) + P(B| A)P( ˜ ˜ P (B) = P (B|A)P (A) + P (B|A)P (A).
We can generalize this property for any partition of the sample space.
We can generalize this property for any partition of the sample space.
Definition: Partition
A partition is a group of mutually exclusive sets A ,A , . . . ,Ak with the property that Definition: Partition. A ∪ A ∪ · · · ∪ A = S.
A partition is a group of mutually exclusive sets A1 , A2 , . . .k, Ak with the property that [ ... [ A S. 1 [ A2 puzzle k =together. Think of the sample space as aAjigsaw put Then a partition consists of the pieces of the puzzle: The pieces do not overlap, but put together theyconsists create the sample space. Think of the sample space as a jigsaw puzzle put together. Then a partition of the pieces of the example of a partition is two pieces, wherethe onesample piece space. is a set The and simplest the otherexample piece puzzle:The thesimplest pieces do not overlap, but put together they create of a partition is two pieces, oneIfpiece piece of is the complement of that set. If is the complement ofwhere that set. A ∪isAa set ∪ · and · · Akthe is aother partition S, then A1 [ A2 [ . . . Ak is a partition of S, then
P(B) = P(B|A )P(A ) + P(B|A )P(A ) + · · · + P(B|Ak )P(Ak ).
P (B) = P (B|A1 )P (A1 ) + P (B|A2 )P (A2 ) + . . . + P (B|Ak )P (Ak ).
For example, in Figure 3.2, a set B is considered in a sample space S, where S is partitioned For example, in figure a set B considered into seven pieces22,named A isthrough A . in a sample space S, where S is partitioned into 7 pieces named A1 through A7. The probability of B is the proportion of the sample space in the grey circle. But
Figure 3.2
An Event B That Intersects With a Partition A , . . . , A .
Figure 22: An Event B That Intersects with a Partition A1, . . . , A7.
A3 A3
A2 A2
A1 A1
A4 A4
A5 A
5
Event B
A7 A7
A A66
the circle is broken up into 7 pieces as well. We can say that P (B) is the sum of the piece of B conditional on being within A1, plus the piece of B conditional on being within A2, and so on. Notice that some conditional probabilities are larger than others. Overall, the grey circle takes up about a third of the sample space. But looking only within set A1, the grey area takes up nearly half of the total area; looking only within set A4,
MATHEMATICS FOR SOCIAL SCIENTISTS
94
The probability of B is the proportion of the sample space in the gray circle. But the circle is broken up into seven pieces as well. We can say that P(B) is the sum of the piece of B conditional on being within A plus the piece of B conditional on being within A , and so on. Notice that some conditional probabilities are larger than others. Overall, the gray circle takes up about a third of the sample space. But looking only within set A , the gray area takes up nearly half of the total area; looking only within set A , the gray area takes up much less than a third of the total area. Speaking probabilistically, event B is more likely given event A than given event A : P(B|A ) > P(B|A ).
Example 1. Suppose that of the 50 states, 10 are classified as southwestern states, 8 are classified as northwestern, 17 are classified as northeastern, and 15 are classified as southeastern. An army base is to be built in a randomly chosen southern state. What is the probability that the army base is built in a western state?
• We know that the base will be built in a southern state. We want the probability that the state is western. So we are looking for P(W|S). The conditional probability definition tells us that P(W|S) =
P(W ∩ S) . P(S)
• P(W ∩ S) is the probability that the state is southwestern, which is = . . • We do not know P(S), but we do know that E and W are a partition of the sample space consisting of the 50 states. So P(S) = P(S|W)P(W) + P(S|E)P(E) = P(S ∩ W) + P(S ∩ E) =
+
=
=. .
• So the probability of the event is P(W|S) =
P(W ∩ S) . = =. . P(S) .
In this example, another way to think about the problem is that the probability that the state is western given that a southern state will be chosen is the number of southwestern states out of the total number of southern states.
If conditional sets represent outside information that we can use to inform our beliefs about the state of the world, then we can use this information to improve our inferences. This sort of analysis has many uses in the social sciences and underlies most of our quantitative methodology. We can make a better forecast about the possibility of civil war if we condition on information about ethnic fragmentation in the society. We can better predict the electoral performance of a candidate given information about the economy. Any situation in which certain variables help us predict outcomes can be thought of as an exercise in conditional probability.
CHAPTER 3. PROBABILITY
95
3.6 Bayes’ Rule Bayes’ rule is one of the most important (and, at first, confusing) formulas in probability theory. If used correctly, the rule provides us with a means to refine our estimates based on prior theories and a way to update our results given new information. Bayes’ rule is derived quickly from the definition of conditional probability, which tells us that for two events B and A P(A ∩ B) P(B|A) = , P(A) P(A ∩ B) = P(B|A)P(B). The definition also tells us that if we switch A and B, P(A|B) =
P(A ∩ B) , P(B)
P(A ∩ B) = P(A|B)P(B). Since P(A ∩ B) is equal to two different expressions, we have the following equation: P(B|A)P(A) = P(A|B)P(B). Dividing both sides by P(B) gives us the most basic form of Bayes’ rule. Bayes’ rule, version 1: P(A|B) =
P(B|A)P(A) . P(B)
There are two other formulations of Bayes’ rule that are useful. Sometimes we don’t know P(B), but we do know some conditional probabilities involving B. From Section 3.5, we know that ˜ A). ˜ P(B) = P(B|A)P(A) + P(B|A)P( So we can rewrite Bayes’ rule as follows: Bayes’ rule, version 2: P(A|B) =
P(B|A)P(A) . ˜ A) ˜ P(B|A)P(A) + P(B|A)P(
We also know from Section 3.5 that given a partition A ,A , . . . ,Ak , P(B) = P(B|A )P(A ) + P(B|A )P(A ) + · · · + P(B|Ak )P(Ak ). So we have a third version of Bayes’ rule:
MATHEMATICS FOR SOCIAL SCIENTISTS
96
Bayes’ rule, version 3: For some event Aj , which is part of a partition A ,A , . . . ,Ak , P(Aj |B) =
P(B|Aj )P(Aj ) . P(B|A )P(A ) + P(B|A )P(A ) + · · · + P(B|Ak )P(Ak )
This last form of Bayes’ rule is probably the most useful form of the rule for most problems. The logic of Bayes’ rule is that we must consider the possibility of an outcome in every state of the world to really understand which state of the world really exists. A very famous example of Bayes’ rule relates to medical testing for diseases. There are two states of the world: one in which you have a disease, and one in which you do not. You are tested for the disease and the test comes back positive for the disease. Does that mean you have the disease? Not necessarily. We have to answer the following questions first. How likely is it that the test is positive in the case in which you actually have the disease? This probability is sometimes called the true-positive rate. But we must also consider how likely is it that the test is positive in the case in which you do not have the disease. This probability is sometimes called the false-positive rate.6 Suppose that we have a test that is almost always positive when you have the disease, but is sometimes positive even when you don’t have the disease. Then you cannot be sure you have the disease, even if you’ve positively tested for it. Prior to the test, if you thought that that having the disease was unlikely, then you have good reason to be incredulous after the test. The following example puts mathematical language onto this logic. Example 1. You are administering a test for HIV. In 99% of the cases where HIV is present, the test shows it. However, in 3% of the cases in which HIV is not present, the test will give a positive result for HIV. Suppose prior to the test, the probability that a person has HIV is .05. Given that the person tests positive for HIV, what is the probability that the person actually has HIV? Let T be the event of a positive test and H be the probability that the person has HIV. We are trying to find the probability of HIV given a positive test, which is P(H|T). We do not know P(T), but we do know the conditional probability of a positive test given that a person has HIV, P(T|H), ˜ and the conditional probability of a positive test given that a person does not have HIV, P(T|H). The second form of Bayes’ rule is helpful here. P(T|H)P(H) ˜ H) ˜ P(T|H)P(H) + P(T|H)P( . ×. = =. (. × . ) + (. × . )
P(H|T) =
,
which many people find to be a surprisingly low probability given a positive test for HIV.
6 In other applications the true-positive rate is called the sensitivity of a test, and the true-negative rate (the probability that a test is negative when a condition does not exist) is called the specificity of the test. The falsepositive rate is the complement of the true-negative rate.
CHAPTER 3. PROBABILITY
97
In the above example, we started by stating that prior to the test, the probability that the person has HIV is only .05. This statement is a simple example of a prior distribution. Essentially, a prior is what we believe about the likelihood of possible states of the world before observing a test or acquiring data. How do we know that the prior probability is .05? Here is where Bayes’ rule connects to the philosophical approach to inference called Bayesian statistics, discussed briefly at the beginning of this chapter. Prior probabilities are ways to quantify beliefs. A probability close to 1 is a near certain belief that something is true, close to 0 is a near certain belief that something is false, and probabilities between 0 and 1 represent varying degrees of uncertainty that sometime could be true. The prior probability of .05 represented the belief prior to the test that the person has HIV—so the person was slightly concerned but thought that having HIV was unlikely. The probability of HIV after the test, .634, is a simple example of a posterior distribution. We can think of a posterior as the updated beliefs, in light of new information. The calculation of the posterior probability in example 1 would have been different if the prior had been lower or higher. Example 2. You are again administering tests for HIV. As before, in 99% of the cases where HIV is present, the test shows it. However, in 3% of the cases in which HIV is not present, the test will give a positive result for HIV. This time, you are testing two different people. The first person has no reason to suspect HIV, so prior to the test, the probability that the first person has HIV is .001. The second person has a specific concern, and the probability that the second person has HIV is .4. Given that both people test positive for HIV, what is the probability that each person actually has HIV? As in example 1, we employ the second version of Bayes’ rule and derive the following formula: P(H|T) =
P(T|H)P(H) . ˜ H) ˜ P(T|H)P(H) + P(T|H)P(
Since both people take the same test, we plug the same information about the test into the formulas ˜ = . . For the first person, P(H) = . for both people, so P(T|H) = . and P(T| H) and ˜ =. P(H) , so the posterior probability is P(H|T) =
. ×.
.
×. +.
×.
=.
.
˜ = . , so the posterior probability is For the second person, P(H) = . and P(H) P(H|T) =
.
. ×. =. ×. +. ×.
.
These two people differ only in their prior beliefs about having HIV. They take the exact same test and find the exact same result. But because we believed that HIV is so unlikely for the first person, we still believe that it is very unlikely even after a positive test (although, an increase in probability from .001 to .032 is quite substantial). For the second person, because we had a significant concern about HIV before the test, the positive result makes us much more certain that HIV is really present.
The logic of Bayes’ rule can be applied to nearly any situation that involves uncertainty and beliefs about likely states of the world. In my apartment, for example, there is an obnoxious
MATHEMATICS FOR SOCIAL SCIENTISTS
98
smoke detector. To be sure, if there is a fire it will go off. But it often goes off when there is no fire. If I wake up in the middle of the night and hear the smoke detector, I am not inclined to believe that there really is a fire. But then, if I remember that I accidentally left the stove on, my prior belief in a fire increases, and I am much more likely to believe that the smoke alarm indicates a real fire. Likewise, in the social sciences, we can apply Bayes’ rule to just about any topic. Example 3. You are researching government corruption. For this exercise, suppose that a government is corrupt (C ), somewhat corrupt (C ), or not corrupt (C ), and that your prior beliefs for any government are such that P(C ) = . , P(C ) = . , and P(C ) = . . All governments waste some amount of public resources, but corrupt governments usually waste much more than less corrupt governments. You have measured waste W on a 0–10 scale, and you have determined that corrupt and noncorrupt governments are likely to exhibit each level of waste with the following probabilities: W P(W|C ) P(W|C ) P(W|C ) 0
.
1
.
2
.
.
3
.
.
4
.
.
5
.
.
.
6
.
.
.
7
.
.
.
8
.
.
9
.
10
.
What is the probability that a government is corrupt, C , given that its level of waste is W = ? We want to find P(C |W = ). We can use the third version of Bayes’ rule since C , C , and C form a partition of the possible levels of government corruption. This formula is P(C |W = ) =
P(W = |C )P(C ) . P(W = |C )P(C ) + P(W = |C )P(C ) + P(W = |C )P(C )
Plugging in the appropriate values, we find that P(C |W = ) =
. ×. (. × . ) + (. × . ) + (.
×. )
=. ,
CHAPTER 3. PROBABILITY
99
which is nearly a threefold increase over our prior belief that the government was corrupt. We can also update our belief about the government being somewhat corrupt: P(C |W = ) =
. ×. (. × . ) + (. × . ) + (.
×. )
=.
,
which is slightly less than our prior belief. The probability that the government is noncorrupt, however, shrinks to a very small probability: P(C |W = ) = − P(C |W = ) − P(C |W = ) = .
.
To summarize, we observed a high level of government waste and updated our beliefs—that the government in question was more likely to be corrupt.
Exercises 1. There are currently 233 Republicans and 200 Democrats in the House of Representatives, and there are currently 46 Republicans, 52 Democrats, and 2 Independents in the Senate. (a) What is the probability that a randomly drawn member of Congress is a Republican or a Senator? (b) The House Rules committee consists of 13 House members. Ignoring rank and chairmanship, how many possible ways are there to form this committee? (Hint: It is a very large number. Approximate values are okay as long as it is clear how you derived that value.) 2. State whether the two events are more likely to be independent, conditionally independent, or dependent. Explain your answer. If the two events are conditionally independent, state the event that they both depend on. (a) A high school student’s parents belong to a country club; the high school student gets a really good score on the SAT (b) A country has low levels of spending on schools and libraries; the country has a high rate of illiteracy (c) Two mutually exclusive events 3. Provide an example of two events that are clearly independent. 4. If n is a positive integer and k is a positive integer less than or equal to n, demonstrate that ( ) ( ) n n = . k n−k 5. We are going to solve the following classic problem: Ignoring leap years, what is the probability that at least 2 people in a class of 30 share a birthday? But we are going to solve it in steps. Approximate answers are okay for very large values.
MATHEMATICS FOR SOCIAL SCIENTISTS
100
(a) In words, describe the complement of the event under consideration. Also in words, describe the sample space. (b) How many outcomes exist that satisfy the complement event? (c) How many outcomes are in the sample space? (d) Now solve the problem: Ignoring leap years, what is the probability that at least 2 people in a class of 30 share a birthday? 6. How many government officials in a country have taken bribes? You can imagine that it would be very difficult to obtain an answer to this question using surveys. If directly asked, “Have you ever taken a bribe?,” people who have taken one will likely lie and say that they never have. However, there is a strategy to accurately measure the number of people who have a sensitive trait, such as having taken a bribe, in a population. This strategy employs the rules of probability discussed in this chapter. Instead of directly asking respondents whether or not they have taken a bribe, they are shown two statements:7 A: I have at some point taken a bribe. B: I have never taken a bribe. The respondents are also given a spinner. It is a circle with a movable arrow that spins quickly when flicked. Some region of the circle is labeled “A,” and the other part is labeled “B.” The respondents are asked to first spin the spinner. The resultant letter is private information: Because the respondents hold the spinner in their hand and spin it out of the view of the interviewer, they can be confident that only they know the result of the spin. The interviewers do, however, know how big the regions for each letter are, and therefore, the probability of each letter is known. The spinner illustrated below is constructed so that P(A) = . and P(B) = . .
The respondents are then asked to respond either TRUE or FALSE to the statement next to the letter they’ve spun. For example, if an official who has taken a bribe spins 7 This question is based on the methodology employed by Gingerich, D. W. (2013). Political institutions and partydirected corruption in South America: Stealing for the team. New York, NY: Cambridge University Press.
CHAPTER 3. PROBABILITY
101
an A, he should respond TRUE. But since no one but him knows if he spun an A or a B, he can claim that he responded TRUE to having never taken a bribe. There are four possible cases for a respondent:
If the respondent has
Then he or And has she should spun respond
Taken a bribe
A
TRUE
Taken a bribe
B
FALSE
Never taken a bribe
A
FALSE
Never taken a bribe
B
TRUE
Since we cannot tell for any individual whether he or she is responding to Statement A or B, it may seem that this question doesn’t help us. But it is possible to calculate from this information the proportion of the population that has taken a bribe. We will work through this calculation together. Let X be the event in which a respondent says TRUE. Let W be the event in which the respondent has taken a bribe. And let Y be the event that a respondent spins A. (a) Using the set notation language of unions, intersections, and complements, express X in terms of W and Y. (b) Are events W and Y independent? Are events X and Y independent? Why or why not? (c) Using your answer in (a) and (b), demonstrate that ˜ Y). ˜ P(X) = P(W)P(Y) + P(W)P( (d) Solve the equation in (c) for P(W). This quantity expresses the proportion of gov˜ with − P(W).] ernment officials who have taken a bribe. [Hint: replace P(W) (e) Why can’t the spinner have equally sized areas for A and B? (f) Suppose that 35% of the respondents select TRUE for this question. Given that their spinners have a .75 probability of landing on A, what proportion of government officials have taken a bribe? 7. I’ve got a headache. It happens. In fact it happens to me about 1 out of every 10 days. Still, I’m worried because a headache is a common symptom of an oncoming flu: Half of all flu sufferers report headaches. The Centers for Disease Control reports that about 2% of Americans will contract the flu this year.8 So, tell me straight, how likely is it that I’m coming down with the flu? 8. Dear beloved friend, I know this message will come to you as surprise but permit me of my desire to go into business relationship with you. I am Miss Pupet Andre. a daughter 8 See
http://www.cdc.gov/flu/weekly/overview.htm.
MATHEMATICS FOR SOCIAL SCIENTISTS
102
to late Justice.y.Andre of Sudan who died on a plane crash in May 2 2008 in Sudan, my late father was the presidential adviser for local government affairs in Sudan-Khartoum. Meanwhile before the incident, my late Father came to Burkina Faso with the sum of USD5,200,000.00 (US$5.2M) which he deposited in a Financial and Security Institution here in Burkina Faso Republic West Africa for security reasons. I am here seeking for an avenue to transfer this money to a reliable and trustworthy person for Investment purpose. These particular sorts of scams account for a great deal of Internet fraud. According to an article in Mother Jones,9 In 2011, the FBI received close to 30,000 reports of advance fee ploys, called “419 scams” after the section of the Nigerian criminal code that outlaws fraud. The agency received over 4,000 complaints of advance fee romance scams in 2012, with victim losses totaling over $55 million. Nigerians aren’t the only ones committing international advance fee fraud, but nearly one-fifth of all such scams originate in the West African country. Google and other e-mail clients use Bayes’ rule to sort e-mail messages like this one into your spam folder by looking for particular words and combinations of words.10 The downside is that sometimes perfectly legitimate e-mails get sorted into the spam folder because they contain these words as well. (a) Suppose that 5% of all spam messages are the 419 scams that contain the word Nigeria, that 35% of all e-mail messages are spam, and that only .1% of your legitimate e-mails contain the word Nigeria. What is the probability that an e-mail with the word Nigeria is spam? (Be careful; these percentages imply probabilities of .05, .35, and .001, respectively.) (b) One of your best friends from college now lives in Nigeria. You wait patiently for an e-mail from her, anxious to hear how she’s doing, but you hear nothing. Finally, you check your spam folder and you are dismayed to see several e-mails from her over the past few months. On Gmail and other e-mail clients, you have the option to designate filtered messages as nonspam. Supposing that Google will sort an e-mail into the spam folder if there is a .95 probability or higher that the e-mail is spam, what percentage of your nonspam e-mails must contain the word Nigeria for these e-mails to stop being filtered? (c) The New Yorker asks a fascinating question:11 Many of the scammers . . . are indeed from Nigeria. But they lie about lots of things (like “we have a large sum of gold”). Why not also pretend to
9 Eichelberger,
E. (2014, March 20). What I learned hanging out with Nigerian e-mail scammers. Mother Jones. Retrieved from http://www.motherjones.com/politics/2014/03/what-i-learned-from-nigerian-scammers. 10 See http://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering for more detail. 11 Thompson, N. (2012, June 22). The Golden Rule: A new way to defeat Nigerian e-mail scams. The New Yorker. Retrieved from http://www.newyorker.com/culture/culture-desk/the-golden-rule-a-new-way-to-defeat-nigerian-email-scams.
CHAPTER 3. PROBABILITY
be from a place that is not known for e-mail scams? Why don’t they make the whole scheme seem a little less ridiculous at the outset? Why don’t the scammers say they are from some place more plausible, like “New Jersey”? One answer may be that the scammers want to maximize the probability that the people who do respond are actually gullible enough to send money. Let’s demonstrate this logic using Bayes’ rule. Let R represent the event in which a person responds to the 419-scam e-mail, and let G represent the event in which the person is gullible. Assume that 5% of the population is gullible. Let’s consider two versions of a 419-scam e-mail. The first version is quite well written and states that the author is from Alexandria, Virginia. While it won’t convince the majority of its authenticity, it does convince 40% of gullible people and 20% of nongullible people to respond. The second version is filled with grammatical and syntax errors and states that the author is from Lagos, Nigeria. This second letter only convinces 10% of gullible people and only 1 in every 1,000 nongullible persons to respond. For each letter, what is the probability that someone who responds is gullible? 9. One important application of Bayes’ rule is to a topic in game theory called Bayesian games. In these games, an actor must choose an action while being uncertain about the nature of the game and of the preferences of the other actors. Bayesian games have been used to model auctions and bargaining (How much do I bid on a car if I’m not sure whether the car works well?) and negotiation (How much should one country concede in peace negotiations if there is uncertainty about the sincerity or capacity of the other country?). These games depend on the use of signals to reduce uncertainty. For example, a ceasefire may serve as a credible signal of a country’s sincerity, which might influence the other country’s actions so that both countries achieve higher payoffs. These games are called Bayesian because the actors approach the game with prior beliefs about the state of the world; then they observe a signal, and they update their posterior beliefs about the state of the world. They then use these updated beliefs to choose their actions. A classic example of a Bayesian game involves members of a jury.12 A juror in a criminal trial must choose whether to vote to acquit or convict the defendant. The juror has the following payoffs for each vote: • U = if a guilty defendant is convicted or an innocent defendant is acquitted, • U = −z if an innocent defendant is convicted, and • U = −( − z) if a guilty defendant is acquitted, where z ∈ ( , ). The highest payoff the juror can receive is 0, which he or she gets by voting correctly to convict a guilty defendant or to acquit an innocent defendant. Since z is positive, the juror’s two payoffs for voting incorrectly are lower, but the juror does not consider an acquittal of a guilty person to be equally bad as convicting an innocent person. The value of z itself can be thought of as a ratio of the juror’s preferences for each kind of wrong decision. For example, if z = . , then the payoff for convicting
12 This
particular example is adapted from the illustration provided in Osbourne, M. J. (2004). An introduction to game theory (pp. 301–305). New York, NY: Oxford University Press.
103
MATHEMATICS FOR SOCIAL SCIENTISTS
104
an innocent person is −. and the payoff for releasing a guilty person is −. , so the juror considers a false acquittal to be three times worse than a false conviction. The juror is uncertain about the guilt or innocence of the defendant but has prior biases. Let G represent the event that the defendant is actually guilty. Before hearing the arguments of the prosecution and the defense, the juror believes13 the defendant is guilty with probability P(G) = π. (a) One way that the juror can make a decision is by calculating an expected utility (EU) for each action. An expected utility is the sum of payoffs for an action in each state of the world times the probability of being in that world. Before the trial begins, the expected utility for the juror of voting to acquit is given by ˜ EU(acquit) = U(acquit an innocent person)P(G) + U(acquit a guilty person)P(G), and the expected utility of voting to convict is given by ˜ EU(convict) = U(convict an innocent person)P(G) + U(convict a guilty person)P(G). The juror will vote to convict only if EU(convict) > EU(acquit). Derive the mathematical condition that must be met for a juror to vote to convict, and describe in substantive terms what this condition means. (b) Suppose that the defense provides an argument that is superior to the argument advanced by the prosecution, and denote this event as D. This event acts as a signal to the juror about the true state of the world and causes the juror to update his or her beliefs about the guilt of the defendant. Suppose that the probability that the defense presents the stronger case when the defendant is innocent is given by ˜ =p P(D|G) and that the probability that the defense presents the stronger case when the defendant is guilty14 is P(D|G) = q. Find P(G|D). (c) Using your answer from (b), calculate P(G|D) for the following special cases: • π= • p=q • p = and q =
13 In
this problem, it makes sense to think of probabilities in the Bayesian way, as measures of belief. The alternative—the frequentist way—is pretty absurd and reads something like this: “Out of infinitely many realities, the proportion of realities in which the defendant is guilty is π.” 14 Note that these two events aren’t complements. The complement event to “The defense makes a better argument when the defendant is innocent” is “The defense does not make a better argument when the defendant is innocent,” not “The defense makes a better argument when the defendant is guilty.” In mathematical terms, the complement to ˜ is P(D| ˜ not P(D|G). ˜ G), P(D|G)
CHAPTER 3. PROBABILITY
For each case, consider how the juror’s posterior belief in the guilt of the defendant, P(G|D), compares with the juror’s prior belief, P(G). Describe whether or not the signal provided by the trial matters, and if so, describe how strong the signal is. 10. From the New York Times, Saturday, September 14, 2013:15 U.S. and Russia Reach Deal to Destroy Syria’s Chemical Arms By MICHAEL R. GORDON GENEVA—The United States and Russia reached a sweeping agreement on Saturday that called for Syria’s arsenal of chemical weapons to be removed or destroyed by the middle of 2014 and indefinitely stalled the prospect of American airstrikes. The joint announcement, on the third day of intensive talks in Geneva, also set the stage for one of the most challenging undertakings in the history of arms control. “This situation has no precedent,” said Amy E. Smithson, an expert on chemical weapons at the James Martin Center for Nonproliferation Studies. “They are cramming what would probably be five or six years’ worth of work into a period of several months, and they are undertaking this in an extremely difficult security environment due to the ongoing civil war.” Although the agreement explicitly includes the United Nations Security Council for the first time in determining possible international action in Syria, Russia has maintained its opposition to any military action. But George Little, the Pentagon press secretary, emphasized that the possibility of unilateral American military force was still on the table. “We haven’t made any changes to our force posture to this point,” Mr. Little said. “The credible threat of military force has been key to driving diplomatic progress, and it’s important that the Assad regime lives up to its obligations under the framework agreement.” In Syria, the state news agency, SANA, voiced cautious approval of the Russian and American deal, calling it “a starting point,” though the government issued no immediate statement about its willingness to implement the agreement. The Syrian government itself was responsible for destroying its arsenal of chemical weapons, and as the last paragraph of the above passage indicates, it was unclear at the time whether the Syrian government would be compliant in carrying out this task. Even if the Syrian government complied with this agreement, the conditions within Syria at the time due to the civil war may have made it impossible to remove weapons from some sites. On the other hand, a noncompliant government may have still cleared some sites in an attempt to appease Russia and the United States while maintaining some of its chemical arsenal.
15 See
http://www.nytimes.com/2013/09/15/world/middleeast/syria-talks.html.
105
MATHEMATICS FOR SOCIAL SCIENTISTS
106
Suppose that the Syrian government will be either compliant (C) or noncompliant ˜ Suppose further that there are 10 sites where the Syrian government stores chem(C). ical weapons. If the Syrian government is compliant, then it will be able to clear a site with a probability of .95. If the government is noncompliant, it will clear a site with a probability of .6. Assume further that the probability of clearing one site is independent from the probability of clearing any other site. Let our prior estimate of the probability that the Syrian government is compliant be P(C) = . . Suppose that when the deadline comes around, the government is able to clear seven sites and claims that clearing the other 3 sites proved too difficult. (a) What is the probability that a compliant government would have cleared 7 of the 10 sites? What is the probability that a noncompliant government would have cleared 7 out of the 10 sites? (b) What is the probability that the government was actually compliant? 11. Popular Science describes a mathematical procedure in which social scientists were able to measure the exact ideological position of Barack Obama and Mitt Romney during every day of their 2012 presidential election campaigns, using the text of the speeches each candidate gave.16 While the magazine refers to this approach as “artificial intelligence,” the procedure is just a really cool use of Bayes’ rule. This exercise is a simplification of that procedure. To measure ideology from speeches, we first have to consider two questions: • What are the ideologies we are trying to measure? • What does a speech that invokes a particular ideology look like? To avoid oversimplifying the concept of ideology, we consider six categories of ideology: far left, economic left, center left, center right, libertarian right, and far right (denoted as events FL, EL, CL, CR, LR, and FR, respectively). For each category, we use a book or a periodical as an exemplar of that particular ideology. For example, we scan every page of Glenn Beck’s Common Sense: The Case Against an Out-ofControl Government, Inspired by Thomas Paine into a computer with software to read the individual words and store them in a separate data set. The program can then count the number of occurrences of each word. These books are called corpus texts, and we will compare the frequency of use of particular words in the candidates’ speeches with the use of these words in the corpus texts. In the table, the counts in each cell are the number of uses of each word per 10,000 words. Because Noam Chomsky uses the word economy 10 times out of every 10,000 words, there is a probability of .001 that a randomly drawn word from Chomsky’s book is economy. 16 Sofge,
E. (2014, September 24). The machines vs. Mitt Romney: How artificial intelligence is parsing political rhetoric. Popular Science. Retrieved from http://www.popsci.com/blog-network/zero-moment/machinesvs-mitt-romney-how-artificial-intelligence-parsing-political. See also Gross J. H., Acree, B., Sim, Y., & Smith, N. A. (2013). Testing the etch-a-sketch hypothesis: Measuring ideological signaling via candidates’ use of key phrases. Presented at the 2013 annual meeting of the American Political Science Association. Retrieved from http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2299991.
CHAPTER 3. PROBABILITY
Ideology
Far left
107
Corpus Use of Use of Use of Use of Use of Use of Text the word the word the word the word the word the word economy justice moral racism income tax terrorism Noam Chomsky
30
3
50
15
10
10
New Republic
20
30
40
David Brooks
30
10
20
Libertarian right
Rand Paul
20
20
10
Far right
Glenn Beck
20
80
30
Economic left Robert Reich Center left Center right
Suppose on a particular day during the campaign, our prior beliefs about Barack Obama’s ideology are as follows: • P(FL) = . • P(EL) = . • P(CL) = . • P(CR) = . • P(LR) = . • P(FR) = . Admittedly, Obama doesn’t simply have one of these ideologies, but by treating the six ideologies as a partition of the sample space, we can think of these probabilities as representations of the mix of ideologies that influence Obama. If a randomly drawn word from one of Obama’s speeches is terrorism, how do we update our beliefs about Obama’s ideology? (Hint: You will need to compute Bayes’ rule once for every ideological category.)
PART II Calculus
In the fall of 1972 President Nixon announced that the rate of increase of inflation was decreasing. This was the first time a sitting president used the third derivative to advance his case for reelection. —Hugo Rossi
4 Limits and Derivatives
4.1 What Is a Limit? Functions are useful to us because they provide one, unambiguous answer for the information we input into the function. But there are situations in which a function cannot provide an output for the input we are interested in. Sometimes the output is simply beyond our observation. Other times, we are more interested in prediction and long-term trends. In these situations, we are interested in where a function is going, not just where it is. A limit describes where a function appears to be headed. If we make the statement f(a) = L, we are saying that the value of the function is L precisely when x = a. If instead we make the statement lim f(x) = L,
x→a
we are using limit notation to describe where the function appears to be heading instead of where it is. The components of this notation translate as follows: Notation x→a lim f(x) = L
Translation “x approaches a” “The function f approaches the value L”
limx→a f(x) = L “The function f approaches a value of L as its input x approaches a” This statement is not the same as f(a) = L. It is possible for a function to approach a value without actually achieving it. Consider, for example, the following function: x − . x− This function has no output for x = , since this value of x would make the denominator of the function equal to zero. But what does the function look like if we approach 5? If we choose values of x that are less than 5 but get closer and closer to 5, then f(x) appears to approach 10: f(x) =
111
MATHEMATICS FOR SOCIAL SCIENTISTS
112
x
f(x)
.
.
.
.
.
.
.
.
.
.
Since we’ve chosen values of x less than 5, we say that x → from the left, since the values of x on a graph are smaller on the left. The left limit of the function as x → is 10. We can denote this limit as x − = , lim x→ − x − where the negative sign next to the 5 denotes that the limit is a left limit. If we approach 5 by choosing values of x that are greater than 5, we see the function behaving similarly: x
f(x)
.
.
.
.
.
.
.
.
.
.
The right limit of the function as x →
is also 10. We denote this limit as
x − = , x→ + x − where the positive sign next to the 5 denotes that the limit is a right limit. lim
Existence of a limit: The limit of a function exists only if the left and right limits are equal to the same value: lim f(x) = lim f(x).
x→a+
x→a−
So in the above example, since the left and right limits are equal, we can say that lim
x→
x − x−
=
.
CHAPTER 4. LIMITS AND DERIVATIVES
113
if y ≤ , Example 1. Let h(y) = if y > . We can show that the limit of h(y) as y → does not exist. The right limit is lim h(y) = ,
y→ +
and the left limit is lim h(y) = .
y→ −
Here, we observe that lim h(y) ̸= lim h(y).
y→ +
y→ −
Therefore, limy→ h(y) does not exist. The graph demonstrates the gap in this function to the left and to the right of y = .
The point y = in Example 1 is called a discontinuity. A formal definition of continuity is provided in Section 4.2, but briefly, a function is continuous if its graph can be drawn without lifting the pencil off the paper. The graph in Example 1 is discontinuous because the segment to the left of y = does not intersect the segment to the right of y = . The formal definition of a limit is very technical. I present it here, but the English interpretation of this definition is more important. Definition: Limit limx→a f(x) = L if and only if for all ϵ > if
there exists a δ >
such that
< |x − a| < δ then |f(x) − L| < ϵ.
Here the limit approaches a value L as x approaches a value a. The symbols ϵ and δ represent very small numbers. The definition says that we can always choose a small enough
MATHEMATICS FOR SOCIAL SCIENTISTS
114
value of δ, the distance between x and a, so that the difference between f(x) and L is less than ϵ, no matter how small it is. In other words, by pushing x closer and closer to a, we can make the value of the function get as close as we want to L.
●
11.0
ε
10.0
f (x) f(x)
10.0
f(x)
f (x)
10.5
y
10.5
y
x Closing In on a Value Using the Definition of a Limit
11.0
x Figure 4.1
●
9.5
9.5
δ
f (y)
9.0
9.0
f (y) 4.0
4.5
5.0
5.5
6.0
4.0
4.5
x x
1
f(x)
f (x)
●
6.0
fε (x)
10.0
f (x)
10.0
f(x)
5.5
y
11.0
y
10.5
11.0
y
ε
f (x)
6.0
x
10.5
1
5.5
xx
x
y
5.0
●
9.5
δ
9.5
δ
f (y)
f (y)
f (y) 9.0
9.0
f (y) 4.0
4.5
5.0
5.5
6.0
xx
4.0
4.5
5.0
xx
y y In more straightforward English, limx→a f(x) = L means that if we close in on a value a on the x-axis, we also close in on the value L on the y-axis. Figure 4.1 demonstrates that in 1 the1 example of the function 1 f (x) x − , x− the closer x gets to 5, the closer f(x) gets to 10. We can make f(x) get as close as we want to 10 by choosing an x that is close enough to . In this example, 5 is a, 10 is L. If x is pushed f (y) f (y) 1 f (x)
f(x) =
CHAPTER 4. LIMITS AND DERIVATIVES
115
infinitesimally close to 5 (separated by a trivial distance δ), then f(x) is infinitesimally close to 10 (separated by a trivial distance ϵ). Limits are also good ways to think about infinite values. Recall from Section 2.2 that the number infinity, denoted ∞, represents a quantity that is larger than every real number, although it is not itself considered a real number. Likewise, the number −∞ is smaller than every real number. Since ∞ and −∞ are not real numbers, we cannot simply plug these numbers into functions. Instead, we can use limits to assess these values. Example 2. Assess limx→∞ x . Assessing the value of a particular limit is the same thing as solving the limit. There are some useful techniques for solving limits presented in Section 4.3, but the most basic and brute-force method is to plug in values closer and closer to the appropriate value and observe the pattern for the values of the function. When we are considering x → ∞, we plug in larger and larger values of x. x
/x .
,
,
.
,
.
,
.
It appears that as x → ∞, the value of the function approaches 0. Indeed, the solution of the limit is lim = . x→∞ x
In the social sciences, ∞ is a useful way to represent the long-term behavior of some process. Example 3. Suppose that we model the level of government spending S in a country (in thousands of dollars) as a function of its population p (in thousands of people) and we find that spending is determined by the following function: S(p) =
p − p+
.
Suppose one program that contributes to spending is a social welfare program whose cost W also depends on the population of the country: W(p) = p + p + . Given that the population of the country is growing, what proportion of the spending in the country will eventually be devoted to the social welfare program?
MATHEMATICS FOR SOCIAL SCIENTISTS
116
Since the population is growing and the problem considers spending in the long term, we can consider the following limit: W(p) p + p+ lim = lim . p→∞ S(p) p→∞ p − p+ Again, there are more elegant ways to solve this limit, but for now we can plug larger and larger values of p into the ratio: p
W(p)/S(p) .
. . . ,
.
So although the program accounts for large proportions of spending for small populations, this program takes up about 16% of spending as the population becomes large.
In addition, we can use limits to describe places where a function takes on an infinite value for a finite input. Example 4. Solve lim
x→
(x − )
.
Another simple way to solve a limit is to consider the graph of the function:
CHAPTER 4. LIMITS AND DERIVATIVES
117
At x = , the function is increasing toward ∞ on both the left and the right. Since the left and right limits both approach ∞, the limit exists, and lim
x→
(x − )
= ∞.
4.2 Continuity and Asymptotes Continuity has a clear-cut informal definition: A function is continuous if the graph can be drawn without lifting the pencil from the paper. It is also important to know the formal definition of continuity. Definition: Continuity at a point A function is continuous at point a if and only if lim f(x) = f(a).
x→a
So in Section 4.1, the function in Example 1 is discontinuous at the point y = because h( ) = , but the limit there does not exist. A function is said to be continuous if it is continuous at every point in its domain. Therefore, since the function in Example 1 in Section 4.1 has a discontinuity, it is not a continuous function.1 Another way to think about continuity is that both the function and the limit must exist at every point. Values of ∞ are acceptable for limits but not for real number valued functions. For example, consider again the function f(x) =
(x − )
,
which we considered in Example 4 in Section 4.1. The limit as x →
exists because
= ∞ and lim = ∞. x→ − x x But since f( ) is undefined (we cannot say f( ) = ∞ since we are working with real numbers), the function is discontinuous at x = . Functions that have discontinuities where the left and right limits are equal to either ∞ or −∞ are said to have vertical asymptotes at the discontinuous points. That is, a function f(x) has a vertical asymptote at a point a if any of the following statements is true: lim
x→ +
1. limx→a f(x) = ±∞. 2. limx→a− f(x) = ±∞. 3. limx→a+ f(x) = ±∞. 1 This function however is said to be piecewise continuous since it is continuous within discrete regions in its domain.
MATHEMATICS FOR SOCIAL SCIENTISTS
118
A vertical asymptote can look like a sinkhole in a graph going down or going up, as shown in Figure 4.2. The graph can also approach ∞ on one side of the asymptote and −∞ on the other side. Figure 4.2
The Graph of f(x) =
(x− )
That Has a Vertical Asymptote at x =
Note that the limit does not necessarily have to exist at a point a for a function to have a vertical asymptote at point a. All that is required is that either the left or the right limit diverge to positive or negative infinity. An example of a function that has a vertical asymptote at a point where the limit does not exist is f(x) = /x at the point x = . For this function, lim
x→ +
x
= ∞ and lim
x→ −
x
= −∞.
Therefore, f(x) has no limit that exists at x = , but it has a vertical asymptote at x = . A function has a horizontal asymptote if it converges to a particular value of f(x) for very large positive or negative values of x. Formally, a function has a horizontal asymptote of height b if lim f(x) = b, or lim f(x) = b. x→∞
x→−∞
At the edges of the graph, for extreme values of x in either direction, the height of the graph converges to the horizontal asymptote, as shown in Figure 4.3. In Example 3 in Section 4.1, the ratio of the functions S(p) and W(p) has a horizontal asymptote since the ratio approaches 0.16 as p → ∞.
CHAPTER 4. LIMITS AND DERIVATIVES
The Graph of a Function f(x) That Has a Horizontal Asymptote of Height 0 x
1.0
Figure 4.3
119
f(x)
0.5
y
0.0
f (x)
−0.5
f (y)
0
10
20
30
40
50
xx
4.3 Solving Limits
y
1
There are four properties that are useful for solving limits.
f (x) 1. The limit of the sum (or difference) of functions is equal to the sum (or difference) of the limits of each function: lim [f(x) + g(x)] = lim f(x) + lim g(x), x→a f x→a (y) lim [f(x) − g(x)] = lim f(x) − lim g(x).
x→a
x→a
x→a
x→a
2. The limit of a product (or quotient) of functions is equal to the product (or quotient) of the limits of each function: lim [f(x) × g(x)] = lim f(x) × lim g(x),
x→a
x→a
x→a
lim [f(x)/g(x)] = lim f(x)/ lim g(x).
x→a
x→a
x→a
3. The limit of a constant k is simply the constant itself: lim k = k.
x→a
4. The limit of the product of a constant and a 1function is the product of the constant and the limit of the function: lim [kf(x)] = k lim f(x). x→a
x→a
MATHEMATICS FOR SOCIAL SCIENTISTS
120
In practice, though, it is not usually necessary to think about these properties. Most limits are fairly easy to solve, especially when a function is continuous at a specified point. If asked to evaluate the limit of a function at a point where the function is continuous, simply plug in the point. Example 1. Solve lim ( x − )( x + ).
x→
We will use the properties of limits to derive the answer: lim ( x − )( x + ) = lim ( x − ) lim ( x + )
x→
x→
= =
[(
x→
][( ] ) ) lim x − lim lim x + lim
x→
[ (
x→
x→
][ ( ] ) ) lim x − lim x +
x→
x→
x→
= [ ( ) − ][ ( ) + ] =
× =
.
This logic is precisely the same as plugging x = directly into ( x − )( x + ).
Plugging in is always the first thing to try, but sometimes plugging in won’t work, such as in Example 4 in Section 4.1, because the function is discontinuous at the point in question. Another situation in which plugging in won’t work is when x → ∞, that is, when x becomes unboundedly large. To solve these infinite limits, it is usually helpful to use an algebra trick. One trick is to try to simplify an algebraic expression. Example 2. Let f(x) = x + x + . Solve the following limit:
lim
a→x
f(x) − f(a) (x + x + ) − (a + a + ) = lim . a→x x−a x−a
We begin by simplifying this expression as much as possible. First we distribute the negative sign, x + x+ −a − a− lim , a→x x−a we cancel the 7s, x + x−a − a lim , a→x x−a and we combine the like terms, lim
a→x
(x − a ) + (x − a) . x−a
CHAPTER 4. LIMITS AND DERIVATIVES
121
Recall that the difference of squares (x − a ) simplifies to (x − a)(x + a): lim
a→x
(x − a)(x + a) + (x − a) . x−a
Next we factor the (x − a) term out of the numerator, (x − a)(x + a + ) , x−a
lim
a→x
and we cancel the (x − a) terms in the top and bottom, leaving lim x + a + .
a→x
Now it is easy to solve the limit by plugging x in for a: x+ . lima→x f(x)−f(a) x−a
In fact, the expression is the definition of a derivative: a function that represents the slope of a function f(x) at any particular point x. Derivatives are discussed at length later in this chapter. In this example, x + is the derivative of f(x) = x + x + .
For limits in which x → ∞, the algebraic trick that is most useful is multiplying the top and bottom of a fraction by the same thing. We want to divide all the terms of a function by the largest power of x in order to put x in the denominator of fractions. Then we can use the property that = , p> . xp That is, as x gets larger, the reciprocal of x gets closer and closer to 0 as long as the power of x is positive. lim
x→∞
Example 3. Solve lim
x→∞
x + x + . x +
We want to divide all the terms by x in order to put x in the denominator of fractions: lim
x→∞
x + x + x +
= lim
x→∞
= lim
x→∞
=
( x + x + )( x ) ( x + )( x ) ( +
x
( +
+
x
x
)
( + limx→∞
x
)
+ limx→∞
( + limx→∞
x
)
x
)
.
Note that all of the limits can now be replaced with 0. Therefore, the solution breaks down to = .
MATHEMATICS FOR SOCIAL SCIENTISTS
122
4.4 The Number e Consider the following special limit:
(
)x +
lim
x→∞
.
x
There is no clear-cut way to go about solving this limit. But let’s plug in bigger and bigger values of x and see what happens: (
x
+
)x x
. .
,
,
.
,
.
,
.
For very large values of x, it appears that ( + x )x converges to the irrational number . . . . This number is a very important number in mathematics. Like the number π, it is so important that it has its own name: e. Definition: The number e is defined to be the infinite limit of the function ( )x lim + = . . . . = e. x→∞ x
The number e is named for the famous Swiss mathematician Leonhard Euler, one of the greatest mathematicians and scientists ever to have lived. Euler was the first mathematician to define a function. He also made incredible discoveries that form the basis of real analysis, number theory, engineering, physics, astronomy, and logic. Regarding e, Euler discovered other applications for the number that are both elegant and astonishing. Euler discovered that e=
∞ ∑ n=
n!
=
!
+
!
+
!
+ ··· .
In addition, the normal distribution—a foundational tool in statistics that describes the natural state of many random processes—is an exponential function of base e: f(x|μ,σ ) = √
πσ
e
−(x−μ) σ
.
CHAPTER 4. LIMITS AND DERIVATIVES
123
In fact, the exponential function f(x) = ex has the astonishing property that the value of the function at a point x is always equal to the slope of the function at x. In Chapter 5, we’ll show that ex is the only nonzero function that is equal to its derivative.2 Furthermore, Euler proved an equation that other mathematicians have called “the most remarkable formula in mathematics.”3 This equation relates five fundamental mathematical constants: π, the ratio of the circumference √ to the diameter of any circle; e, Euler’s constant; i, the imaginary number defined to be − ; 1, the multiplicative identity; and 0, the additive identity. This equation is eπi + = . These are the five most important numbers in all of mathematics. π forms the basis of all trignometry. i is important in differential equations to model dynamic processes, and engineering would be impossible without it. e is at the heart of all statistical theory. 0 and 1 form the bases of addition and multiplication in the real numbers, respectively. And Euler’s equation relates all of these constants together. In short, e was discovered by Euler, not invented. If such a discovery can have such sweeping results across all intellectual disciplines, then it is not difficult to imagine that the fabric of the universe is mathematical in nature.
4.5 Point Estimates and Comparative Statics One of the ways social scientists can begin to explain a particular situation or phenomenon is to build a model that will make predictions about that situation or phenomenon. So for example, suppose we are trying to explain voter turnout. Our theories lead us to believe that voters are less inclined to vote when there are a lot of negative television advertisements being aired. We can build a simple model of voter turnout by using the number of negative televised campaign ads as a predictive variable. Clearly there are other factors that affect turnout, but let’s just keep the one independent variable for now to keep the example simple. I propose the following model: ( ) x f(x) = − , c where f(x) is the voter turnout, measured as a percentage. x is the number of negative campaign ads aired on certain stations in a certain place in a certain period of time, and c is the total number of commercials aired in that place during that time and is treated as constant. Basically, this model proposes that turnout decreases linearly as the percentage of television commercials that are negative campaign ads increases. This model predicts 50% turnout when there are no negative campaign advertisements and 0% turnout when every commercial is a negative campaign ad.
2 This property is also true of all multiples of ex . The function f(x) = ex , for example, also has a slope at point x that is equal to ex . But that does not diminish how extraordinary this property of e happens to be. 3 Richard Feynman, an American quantum physicist.
MATHEMATICS FOR SOCIAL SCIENTISTS
124
Building a model is easy. Judging the quality of the model is more difficult. But one obvious place to look is how well the model predicts turnout given the percentage of negative campaign advertisements. Suppose we have measured the data (reported in Table 4.1). Table 4.1
Voter Turnout and Percentage of Negative Campaign Ads
Observation Observed % Voter Expected Voter Number Negative Ads Turnout Turnout 1
.
2
.
.
3
.
.
4
.
5
.
.
Given the data, how shall we evaluate the model? One way to gauge the quality of the model is to consider the accuracy of the point estimates generated by the model. A point estimate is a prediction of a model when the independent variables are at certain values. For the model described above, when the number of negative ads in a market is x, the point estimate for turnout is f(x). So when negative advertisements make up 8% of the commercials on television, the point estimate of the model is 46% turnout. But in reality, the turnout is only 21.2%. In fact, none of the point estimates provided by the model in Table 4.1 are accurate. In social science modeling, even very complete models will rarely provide accurate point estimates of a continuous dependent variable such as voter turnout. There are better ways to judge the usefulness of a model. A comparative static describes the behavior of the dependent variable, f(x), as the independent variable, x, changes. So the model predicts that as the percentage of negative campaign ads increases, voter turnout will decrease. In Table 4.1 and Figure 4.4, the percentage of negative campaign ads increases between Observations 1 and 2 and between Observations 2 and 3, but turnout also increases. Between Observations 3, 4, and 5, the percentage of negative ads increases, and voter turnout does in fact decrease. So the model correctly predicts the comparative statics of the situation half of the time. There are better models out there to measure voter turnout. Effective models in political science should predict the correct comparative statics the vast majority of the time.
4.6 Definitions of the Derivative One quick way to measure a comparative static is to measure the change in the dependent variable, f(x), as the independent variable moves from one point, x , to the next, x . This sort of measurement is called the average rate of change of a function from x to x .
CHAPTER 4. LIMITS AND DERIVATIVES
Figure 4.4
125
50
Observed and Expected Voter Turnout Versus Percentage of Negative Ads Observed Expected
45
● ●
● ●
●
35
● ● ●
30
● ●
●
15
20
25
Voter Turnout
40
●
10
15
20
25
30
Percent of Ads That Are Negative
Figure 4.4 plots the five data points from the expected model of voter turnout and the observed turnout. The average rate of change of observed voter turnout as the percentage of negative advertisements increases from 8 to 15 is f(x ) − f(x ) = x −x
. − . . = ≈ . . −
So a 7% increase in negative advertisements causes turnout to increase by an average of 7.5%, which means that a 1% increase in negative advertisements causes voter turnout to increase by an average of 1.07%. The average rate of change from one point to another is the slope of the secant line between the two points. The left panel of Figure 4.5 illustrates an example of a secant line. On any curve, the secant line between two points is the straight line that connects the two points, and the slope of the secant line is the average rate of change from the first point to the second. In fact, the classic formula for the slope of a linear function of the form y = mx + b, m=
f(x ) − f(x ) , x −x
is the same as the formula for the average rate of change on any curve between two points x and x . On a linear graph, the slope is constant everywhere. On a nonlinear graph, the slope is constantly changing from point to point. We are interested in the instantaneous rate of change at a point on a nonlinear graph. Suppose we can map some social science model as a smooth, nonlinear curve. Then the instantaneous rate of change at a point tells us how
points. The left panel of figure 27 illustrates an example of a secant line. On any curve, the secant line between two points is the straight line that connects two points, and the slope of the secant line is the average rate of change from the first point to the second. In fact, the classic formula for the slope of a linear function of the form y = mx + b, f (x2 ) f (x1 ) m= , x2 x1 MATHEMATICS FOR SOCIAL SCIENTISTS
126
is the same formula for average rate of change on any curve between two points x1 and x2 .
Figure AnExample Example of of aa Secant Secant Line Line and and aa Tangent Tangent Line Figure4.5 27: An Line for for the the Same Same Function Function. Secant Line
x2 f(x2)
x1 f(x1)
Tangent Line
x f(x)
●
●
●
x1 x1
x2x2
xx
On a linear graph, the slope is constant everywhere. On a nonlinear graph, the slope is constantly dependent is changing at thatinpoint. Such information allows researchers study changingthe from point tovariable point. We are interested the instantaneous rate of change at a to point on a nonlinearthe graph. Suppose can process. map some social science model as a smooth, nonlinear curve. Then the dynamics of a we social instantaneous rate ofthe change at a rate pointoftells us how is changing at that point. Such Finding average change is the easy.dependent How canvariable we find an instantaneous rate of information allowsImagine researchers to that studywe the social line process. change? again aredynamics drawingofa asecant between two points, as in the left Finding the average rate of change is easy. How can we find an instantaneous rate of change? Imagine panel of Figure 4.5. Now imagine that we push those two values of x together until they are so again that we are drawing a secant line between two points, as in the left panel of figure 27. Now imagine they might as well the sameuntil point. Then theasone thethe right panel that we close push those two values of be x together they arethe so line closeresembles they might wellinbe same point. of Figure 4.5. When a secant line hits the curve at only one point with the same direction Then the line resembles the one in the right panel of figure 27. When a secant line hits the curve atas only
the curve, then it is called a tangent line at that point on the curve. The slope of the tangent line is the instantaneous rate of change of 104 the function at that point. We can approximate the slope of the tangent line by finding the slopes of secant lines in which the two points are pushed closer and closer together. Let one of the two points be called a, and let the other be x. Then we can push the two points infinitely close together by taking a limit. This limit is the instantaneous rate of change of a function at the point a. We call it a derivative. Definition: Derivative (Version 1) A continuous function f(x) is differentiable at the point x = a if and only if the limit lim
x→a
f(x) − f(a) x−a
exists. If the function is differentiable, then the value of this limit is the derivative of the function at the point x = a.
There is another way to express the definition of a derivative. Suppose the two points are exactly distance h away from each other on the x-axis. Then we can push the two points together by making h smaller and smaller until it eventually reaches 0.
CHAPTER 4. LIMITS AND DERIVATIVES
127
Definition: Derivative (Version 2) A continuous function f(x) is differentiable if and only if the limit f(x + h) − f(x) h exists. If the function is differentiable, then the resultant function that is the solution of this limit is the derivative of the function. lim
h→
Note that these definitions can be used to find the instantaneous rate of change at a specific point. If we set a = in Version 1 of the definition or x = in Version 2, then the definitions give us the instantaneous rate of change precisely at 5. But if we keep a and x general, then we can derive general formulas for the instantaneous rate of change at any point. Simply plug a number into this general formula to find the instantaneous rate of change at this point. Example 1. Find the derivative of f(x) = x + x − using the first version of the definition of a derivative. Whenever we use this first definition, we are stuck with (x − a) in the denominator, which approaches 0 as x → a. Canceling out this term is always the trick. It will be helpful to recall from Section 1.7 that polynomials of the form x − a factor to (x + a)(x − a). The derivative is lim
x→a
f(x) − f(a) x + x− −( a + a− ) = lim x→a x−a x−a x − a + x− a− + = lim x→a x−a (x − a ) + (x − a) = lim x→a x−a (x + a)(x − a) + (x − a) = lim . x→a x−a
The (x − a) term appears in every term in the numerator and denominator, so we can cancel it out. Therefore the derivative is lim (x + a) +
x→a
= (a + a) +
= a+ .
So for any point a, the slope of the curve x + x + at the point x = a is a + .
Example 2. Find the derivative of f(x) = x + x − using the second version of the definition of a derivative. The second version of the definition has the h term in the denominator, which must be canceled out. lim
h→
f(x + h) − f(x) (x + h) + (x + h) − − ( x + x − ) = lim h→ h h (x + xh + h ) + (x + h) − − x − x + = lim h→ h
MATHEMATICS FOR SOCIAL SCIENTISTS
128
x − x + xh + h + x − x + h − + h xh + h + h = lim h→ h = lim x + h + = lim h→
h→
= x+ . Either version of the definition of the derivative gives us the same result.
Both versions of the definition of a derivative also define when a function is differentiable. The derivative of a function exists when the function is continuous and the limit definition of the derivative exists. That means that for all points a in the domain of a function f, lim f(x) = lim f(x) = f(a)
x→a+
x→a−
and f(x) − f(a) f(x) − f(a) = lim . x→a− x−a x−a In other words, f(x) must be both continuous and smooth for the derivative to exist. Functions can be continuous without being differentiable if there is a jagged point somewhere on the graph where the slope appears to be different just to the left and just to the right of any one point. Such a function is illustrated in Figure 4.6. lim
x→a+
0.0
A Function That Is Continuous Everywhere but Is Not Differentiable at 3 x
−0.5
Marginal Distribution of X Marginal Distribution of Y
−1.5
f (y)
Joint Distribution of X and Y x=3
x=3
−2.0
f (x)
−2.5
−1.0
y
f(x)
B
−3.0
Figure 4.6
0
1
2
3 C
xx D
y 1
y 1
f (x)
4
5
6
CHAPTER 4. LIMITS AND DERIVATIVES
129
4.7 Notation There are two ways to denote a derivative. They are both commonly and interchangeably used, and it will help to be familiar with both ways. There are two notations because calculus was discovered independently by two geniuses at about the same time. One of them was Gottfried Wilhelm Leibniz, a 17th-century German mathematician and philosopher. Leibniz’s contributions have influenced everything from modern philosophy to modern computer science. The other genius who invented calculus was Isaac Newton. You may have heard of him. Newton was well-known as the greatest English scientist. Leibniz was periodically employed by royal families in various German principalities as a chief scientific adviser. The discoveries made by Newton were claimed as English inventions, and the German royalty claimed Leibniz’s discoveries. Therefore, when both invented calculus, the credit to be claimed became a matter of international rivalry. Today we use both Newton’s calculus notation and Leibniz’s notation, as something of a time-tested compromise. Figure 4.7
Newton and Leibniz (XKCD comics by Randall Munroe, #992)
Source: xkcd.com/626
Consider a function f(x) and its derivative. Newton would denote the derivative as f ′ (x), which is pronounced “f-prime.” The second derivative is the derivative of the derivative of a function. Newton denotes the second derivative as f ′′ (x), or “f-double prime,” and the third derivative (the derivative of the second derivative) as f ′′′ (x), or “f-triple prime.” For derivatives of a higher order than 3, we write the actual number of the order of differentiation rather than writing more prime signs. So the nth derivative of f(x) is f (n) (x) according to Newtonian notation.
MATHEMATICS FOR SOCIAL SCIENTISTS
130
Leibniz’s notation is a bit more complicated but also a bit more logical. Remember that a derivative is a rate of change: the change in the dependent variable over the change in the independent variable. In physics, the symbol Δ is used to express the difference between two quantities. We can rewrite the “rise over run” formula for a linear slope: f(x ) − f(x ) Δf(x) = . x −x Δx Leibniz noted that as the points x and x get closer together, the notation Δ stops making sense because the distance between the points is closer and closer to 0. Instead, Leibniz uses the notation d, which represents an infinitesimal, or infinitely small, difference. Using Leibniz’s notation, the first derivative of f(x) is ( ) d df(x) . f(x) or dx dx The second derivative is d dx
(
) or
d f(x) , dx
( ) dn f(x) or dxn
dn f(x) . dxn
f(x)
and, in general, the nth derivative is
The notations really are describing the same thing, so df(x) = f ′ (x). dx All the notations are summarized in Table 4.2. Table 4.2
Newton’s and Leibniz’s Derivative Notations
Derivative
Newton’s Notation Leibniz’s Notation
1st Derivative
f ′ (x)
df(x) dx
2nd Derivative
f ′′ (x)
d f(x) dx
3rd Derivative
f ′′′ (x)
d f(x) dx
nth Derivative
f (n) (x)
dn f(x) dxn
CHAPTER 4. LIMITS AND DERIVATIVES
131
4.8 Shortcuts for Finding Derivatives The limit definitions of a derivative are correct, and logical, and are perfectly valid ways to find a derivative of any function. But they are also somewhat cumbersome and are sometimes very difficult to use. The limit definitions can be used to prove a number of useful theorems and to provide shortcuts for taking derivatives. The proofs are omitted here, but I present the shortcuts. In practice, derivatives are always calculated using these rules and shortcuts, never the limit definitions. In the following theorems, let f and g be differentiable functions, and let k be a constant. 1. If f(x) = k, then f ′ (x) = . The derivative of a constant is 0. The graph of a constant function is a horizontal line, which has a slope of 0 at every point. Example 1.
d ( dx
)= ,
d ( dx
√
π) = ,
d (e) dx
= .
2. The derivative of the independent variable by itself is 1. d x= . dx Recall that the independent variable is whatever variable is listed in the denominator of the derivative notation. Example 2.
d (y) dy
= ,
d (z) dz
= ,
d (η) dη
= .
3. Power rule: If the base is a variable and the exponent is a nonzero constant, then bring the exponent out in front of x as a factor, and subtract 1 from the exponent: d k (x ) = kxk− dx Example 3.
d (x dx
)=
x,
d (z− dz
for any k ̸= .
) = − z− ,
d (α . dα
)= .
4. Derivatives can be broken up over addition (or subtraction): ( ) d f(x) + g(x) = f ′ (x) + g′ (x). dx Example 4.
d (z dz
+ )=
d (z dz
)+
d ( dz
) = z.
5. Constant factors can be brought outside a derivative: ) ( d kf(x) = kf ′ (x). dx
α.
.
MATHEMATICS FOR SOCIAL SCIENTISTS
132
Example 5.
d ( dz
z +
)=
d (z dz
+ )=
× z=
z.
6. Product rule: For the product of two functions, the derivative is the derivative of the first function times the second function plus the derivative of the second function times the first function: ( ) d f(x)g(x) = f ′ (x)g(x) + f(x)g′ (x). dx Example 6. Find the derivative of g(y) = ( y + y + )( y −
).
We can multiply out these polynomials, but that is messy and unnecessary. Instead, we can use the product rule. The derivative of the first part is d ( y + y+ )= y+ . dy The derivative of the second part is d ( y− dy
)= .
Therefore, the derivative is g′ (y) = ( y + y + ) + ( y + )( y −
).
7. Quotient rule: For the quotient of two functions, the derivative is the derivative of the numerator times the denominator minus the derivative of the denominator times the numerator all over the denominator squared: ( ) d f(x) f ′ (x)g(x) − f(x)g′ (x) = . dx g(x) g(x) That’s a tough formula to remember. One way to remember this formula is with the mnemonic device: “low dee-high minus high dee-low over low-squared.” Here, “high” refers to the numerator, “low” refers to the denominator, and “dee” refers to a derivative. So “low dee-high” means the denominator times the derivative of the numerator and “high dee-low” is the numerator times the derivative of the denominator. Example 7. Find the derivative of h(y) =
y + y+ . y−
We can use the quotient rule here. The derivatives of the top and bottom parts were found in the previous example. Plugging into the formula for the quotient rule, we find that the
CHAPTER 4. LIMITS AND DERIVATIVES
derivative is h′ (y) =
133
( y + )( y − ) − ( y + y + ) . ( y− )
There are also some rules about the derivatives of specific but common functions that are helpful to remember. d √ ( x) = √ dx x ( ) d − • Reciprocal functions: = dx x x ( ) d −n • Reciprocal power functions: = n+ n dx x x • Square roots:
• Exponential functions with base e:
d x (e ) = ex dx
• Exponential functions with any other base a: • Natural logarithms:
d dx
(
) ln (x) =
d x (a ) = dx
(
) ln (a) ax
x
Example 8. What is the derivative of f(x) = x +
x
+
√ x + ex ?
We can break up the derivative over addition: ( ) d d d √ d f ′ (x) = ( x ) + + ( x) + (ex ). dx dx x dx dx Then we can handle each individual part of the function. By the power rule, d ( x )= dx By other rules, we know that
x.
( ) d − = , dx x x d √ ( x) = √ , dx x d x x (e ) = e . dx
So the derivative is f ′ (x) =
x +
− + √ + ex . x x
MATHEMATICS FOR SOCIAL SCIENTISTS
134
4.9 The Chain Rule There is one more important shortcut to remember when taking derivatives. The chain rule deals with differentiating the compositions of functions, which were first discussed in Section 2.4. Recall that a function composition is one function stuck inside another. Definition: Function composition If f and g are functions, the composition of f with g is (f o g)(x) = f(g(x)).
The chain rule takes the derivative of compositions. It is the most useful and powerful rule with regard to differentiation. After mastering the chain rule, a calculus student should be able to differentiate just about any function, including some very big and complicated functions. Theorem: The chain rule If f and g are differentiable functions, then ( ) ( ) d d (f o g)(x) = f(g(x)) = f ′ (g(x))g′ (x). dx dx
The proof of the chain rule depends on the conceptual idea of derivatives as ratios of very small differences. Rewriting the chain rule in Leibniz’s notation gives us df df dg = . dx dg dx Leibniz conceived of derivatives as ratios of very small differences, where these differences could still be treated as regular numbers. The difference dg cancels in the top and bottom of the product in the above equation, leaving us with df/dx. The first factor, df/dg, is the derivative of the function f treating the function g as the independent variable. The second factor is the derivative of g with respect to x. The method is the same: Take the derivative of f treating g like x, then multiply it by the derivative of g. Example 1. Find the derivative of h(x) = (x − ) . Let f(x) = x and g(x) = x − . Then h(x) is equal to the function composition (f o g)(x). By the chain rule, the derivative of h(x) is h′ (x) = f ′ (g(x))g′ (x). The derivative of f(x) is
f ′ (x) = x ,
CHAPTER 4. LIMITS AND DERIVATIVES
so f ′ (g(x)) is
135
f ′ (g(x)) = f ′ (x − ) = (x − ) .
The derivative of g(x) is
g′ (x) = x .
So plugging into the chain rule formula, we obtain h′ (x) = f ′ (g(x))g′ (x) = (x − ) x =
x (x − ) .
The most challenging and important part of applying the chain rule is the identification of the layers that can be expressed with function compositions. One method that can clear up the process is the use of capital letters instead of functions. For example, consider the following function: √ x + )
w(x) = e( /
.
Let’s use the letter A to represent the whole exponent, B to represent the denominator of the fraction, and C to represent the terms underneath the square root. Altogether, we can rewrite this function as f(x) = eA , A=
, B √ B = C, C=x + . If we plug C into B, then plug B into A, then plug A into the outermost layer, we would recover the original function. These capital letters stand in for functions, but it is simpler and more intuitive to write A, B, and C than functions g(x), h(x), and j(x). Next, the chain rule tells us that df df dA dB dC = × × × , dx dA dB dC dx so we take the derivative of each step with respect to the next capital letter, df = eA , dA
( ) dA d − = = , dB dB B B dB d (√ ) = C = √ , dC dC C ( ) dC d = x + = x, dx dx and then multiply these derivatives together: f ′ (x) = eA ×
− −xeA × √ × x= √ . B C B C
MATHEMATICS FOR SOCIAL SCIENTISTS
136
The last step is to substitute back in for A, B, and C so that the derivative is expressed simply in terms of x. This step is easiest when we substitute the outermost layers first and work our way inward. First we substitute for A: f ′ (x) =
−xe B √ . B C
Then we substitute for B: √
√ C
−xe C −xe f (x) = √ √ = C/ C C ′
.
And finally we substitute for C: √
−xe f ′ (x) = (x + )
x +
/
.
We’ve now taken the derivative of this complicated function. Recall from Section 4.8 that derivatives can be broken up over addition and subtraction. If a function is the sum of two functions that individually require the chain rule, the best approach is to apply the chain rule to each term separately. Example 2. Find the derivative of ( h(x) = First, consider the first term
( ln
(
)) x + e− x −x+
( ln
)) x . x −x+
We can rewrite this function in terms of its layers as follows: (
( ln
)) x =A , x −x+ A = ln(B), B=
The derivative of the first layer is
x . x −x+
A,
the derivative of the second layer is , B and the derivative of the third layer requires the quotient rule, x(x − x + ) − x ( x − ) . (x − x + )
x +
.
CHAPTER 4. LIMITS AND DERIVATIVES
137
So, multiplying these layers together, we get A B
x(x − x + ) − x ( x − ) . (x − x + )
Substituting for A, this becomes ( ) ln(B) B
x(x − x + ) − x ( x − ) , (x − x + )
and substituting for B this becomes ( ( )) x(x − x + ) − x ( x − ) x x −x+ ln × × . x −x+ x (x − x + ) Notice that we can cancel out a factor of x from the denominator of the second term in this product and the numerator of the third term, and a factor of (x − x + ) from the numerator of the second term and the denominator of the third term. So this expression becomes ( ( )) (x − x + ) − x( x − ) x × × ln x −x+ x x −x+ ( ( ) ) (x − x + ) − x( x − ) x = ln . x −x+ x(x − x + ) Next, we consider the second term in the original function: e− e−
x +
x +
. We rewrite this function as
= eA ,
A=− x + . We take the derivative of each part, eA and − x, multiply them together, and substitute for A: − xeA = − xe−
x +
.
Finally, we add these two derivatives together to obtain the complete derivative of h(x): ′
h (x) =
(
)) x ln x −x+ (
(x − x + ) − x( x − ) − xe− x(x − x + )
x +
.
Some functions will require the use of the quotient rule or the product rule (see Section 4.8) for an intermediate layer. These functions require a slightly different strategy. For example, consider the derivative of √ x − x f(x) = . ln(x + ) We always start applying the chain rule by breaking the function into layers and labeling each layer with a capital letter. In this case, we can write the function as √ f(x) = A,
MATHEMATICS FOR SOCIAL SCIENTISTS
138
x − x , B B = ln(C), C=x + . A=
Note again that we can plug C into B, then B into A, then A into f(x) to obtain the original function. Unlike the previous examples, however, the second layer A contains both x and B and requires the quotient rule. When this happens, we do as follows: 1. We write the derivative of this layer out using the quotient rule, in this example (or the product rule in other situations). 2. We write the derivative of B simply as B′ without trying to simplify further. 3. We apply the chain rule to the remaining layers to calculate B′ . 4. Then we plug this calculation of B′ into the original calculation. Using these steps, it is only necessary to multiply all the layers up to and including the one that uses the product/quotient rule. The later layers are not multiplied; they are only used to calculate the derivative to plug into the product/quotient rule. One of my students described this approach as a “puzzle within a puzzle,” since the problem requires applying the chain rule to a function while applying the chain rule to a more complicated function. Returning to the example, the derivative of each layer is df = √ , dA A dA B( x − ) − ( x − x)B′ = , dB B dB = , dC C dC = x. dx Note that we left B′ in the second layer’s derivative. Next, we multiply all the layers up to and including the one that uses the quotient rule: f ′ (x) = √ × A =
B( x − ) − ( x − x)B′ , B
B( x − ) − ( x − x)B′ √ . B A
We determine B′ by multiplying the subsequent layers together, x B′ = × x = , C C and we substitute this expression into the overall derivative: f ′ (x) =
B( x − ) − ( x − x) Cx √ . B A
CHAPTER 4. LIMITS AND DERIVATIVES
139
Now we proceed as before by substituting for A, f ′ (x) =
B( x − ) − ( x − x) Cx √ , x − x B B
then substituting for B, f ′ (x) =
ln(C)( x − ) − ( x − x) Cx √ , x − x ln(C) ln(C)
and finally substituting for C, f ′ (x) =
ln(x + )( x − ) − ( x − x) x √ x − x ln(x + ) ln(x + )
This function, although rather ugly, is the correct derivative. Example 3. Find the derivative of w(y) =
ey
−
. ln( y + )
We start by writing the function in terms of layers: w(y) =
, A A = BC,
B = eD , C = ln(E), D= y − , E= y + . The derivatives of each layer are dw dA dA dy
=
− , A
=
B′ C + BC′ ,
=
eD ,
dB dD dC dE dD dy
= =
y,
dE dy
=
y.
E
,
x +
.
MATHEMATICS FOR SOCIAL SCIENTISTS
140
The second layer involves the product rule. We multiply all the derivatives of the layers up to and including the second layer: dw dA − ′ dw −B′ C − BC′ = = (B C + BC′ ) = . dy dA dy A A Separately, we use the remaining layers to determine B′ , dB dB dD = = yeD , dy dD dy and C′ , dC dE y dC = = . dy dE dy E We plug B′ and C′ into the overall derivative, ( dw = dy
)
−( yeD )C − B
y E
.
A
Then we substitute for A,
( dw = dy
)
−( yeD )C − B
y E
,
BC
then for B,
( dw = dy
)
−( yeD )C − eD
y E
,
e DC
then for C,
)
( dw = dy
−( yeD ) ln(E) − eD e
D
y E
,
ln(E)
then for D, dw = dy
−( ye
y −
e
) ln(E) − e ( y − )
y −
(
,
ln(E)
and finally for E, dw = dy
−( ye
y −
) ln( y + ) − e e
( y − )
) y E
y −
ln( y + )
(
) y y +
.
When using the chain rule, most people do not think about function compositions, at least not directly. With enough practice, you will develop an intuition to examine a function and identify its different layers, differentiate each layer holding all inner layers constant, and multiply all of these layers together to obtain the full derivative without using the capital
CHAPTER 4. LIMITS AND DERIVATIVES
141
letters. The procedure described above is useful for those who do not yet have this intuition, but at some point, once you feel comfortable doing so, try applying the chain rule without the capital letters. You will be able to perform these calculations much more quickly. In the following example, I take the derivative of the normal distribution without using the capitalletter technique. Example 4. Take the derivative of the normal distribution: f(x|μ,σ ) = √
πσ
e
−(x−μ) σ
.
This problem looks intimidating, but it is easily solved by observing that x is the only variable in the function—the Greek letters are all treated as constants—and by using the chain rule on the layers of the function. We want to find ( −(x−μ) ) d σ √ f ′ (x|μ,σ ) = e . dx πσ √ Since / πσ is a constant, we can bring this term outside the derivative: √
d e πσ dx
−(x−μ) σ
.
Notice that the term inside the derivative now has two layers. The outer layer is ex , which is equal to its own derivative, and the inner layer is −(x−μ) / σ , so by the chain rule the derivative becomes ( ) −(x−μ) d −(x − μ) σ √ e . dx σ πσ Finally, by the power rule, the derivative of −(x − μ) / σ is simply −(x − μ) . σ So the derivative of the normal distribution is f ′ (x|μ,σ ) = √
−(x − μ) e σ πσ
−(x−μ) σ
.
Exercises 1. Imagine a function f(n) in which the input n is a positive integer and the output is a regular polygon with n sides. A regular polygon is a shape in which every side has the same length and every angle is equal. For example, if n = , the function outputs an equilateral triangle,
△
and if n = , the function outputs a hexagon.
MATHEMATICS FOR SOCIAL SCIENTISTS
142
What is limn→∞ f(n)?
7
2. Solve the following limits: (a) lim x − x + x→
(b) lim
y→∞
(c) lim z→
y
z
x+ x y + y − y − y − y+ (e) lim y→∞ y +y − z − z+ (f) lim z→ z− (d) lim
x→∞
(g) lim
x→ +
x−
(h) lim
y− ) ( (i) lim + z→∞ z y→
z
3. In July 2013, Barack Obama gave a speech in which he claimed, “Our deficits are falling at the fastest rate in 60 years.”4 (a) Let t represent a year, and let f(t) be a function that returns the level of the national debt in year t. Recall that the deficit is the yearly change in the national debt. Explain how the statement “Our deficits are falling at the fastest rate in 60 years” can be understood in terms of derivatives of the national debt taken with respect to time. (b) Draw a graph with the total national debt on the y-axis and time (in years) on the x-axis, to illustrate the fact the debt is growing but deficits are shrinking. (Don’t look up specific numbers. Just think about the shape of the curve that reflects this situation.) 4. Draw the first derivative of the function that is graphed below. (Hint: Pay close attention to when the slope of the graph is positive, negative, and 0.)
4 Greenberg, J. (2013, July 25). Obama says deficit is falling at the fastest rate in 60 years. Tampa Bay Times. Retrieved from http://www.politifact.com/truth-o-meter/statements/2013/jul/25/barack-obama/obama-says-deficitfalling-fastest-rate-60-years/.
x
CHAPTER 4. LIMITS AND DERIVATIVES
143
4
63
f'(x)
04 1
0−4 −5
0
f (y) f (y)
−3
2
2 −1 −2
f (x)
4
f(x)
f (x)
f(x) f(x)
2
y
6
y
8
58
x
00 −5
−4
11 −3
22 −2
33 −1
44 0
xx
55 6 6 7 7 1
2
3
4
5
−5
−4
xxx
5. Find the derivatives of the following functions: y (a) f(x) = x + x − x + y √ (b) g(y) = ey − y 1
+ z z (d) j(x) = (x + ) ( x − x − ) √ (e) k(y) = e y ln(z) (f) l(z) = z (c) h(z) = ln(z) + 1
(g) m(x) =
+ e−x √ √y (h) n(y) = ye (i) p(z) =
f (x)
f (x) f (y)
f (y)
ez + ln(z)
(j) q(x) = ln(x + x) (k) r(y) = e /(y
+ y− )
(l) s(z) = ln(z + z)e( /z + z− ) √ x + (m) t(x) = √ x (y − y ) ln( y − ) (n) v(y) = ey − y
1
1
6. Find a formula for the nth derivative (where n is 3 or higher) of the following functions: (Hint: Take the first derivative, the second derivative, the third derivative, and so on,
−3
−2
−1
0 x
1
2
3
4
MATHEMATICS FOR SOCIAL SCIENTISTS
144
until you are confident that you see a pattern.) (a) f(x) = x ln(x) (b) g(x) = xex 7. Use either limit definition of a derivative to find the first derivatives of the following functions: (a) f(x) = x (b) g(y) = y + y + 8. One very cool application of derivatives is the Taylor series approximation of functions. We approximate a function with a polynomial with large enough exponents. For any differentiable function, we have the following theorem: f(x) =
∞ ∑ f (n) (a) n=
n!
(x − a)n .
Here, a is∑ any generic value. We say that the Taylor series is centered on a. The summa∞ tion sign n= adds up iterations of the expression inside the sum for increasing values of n, starting from 0. The Taylor series is exactly equal to the function when we add up infinitely many terms, but for a finite number of terms, we have an approximation: f(x) ≈
N ∑ f (n) (a) n=
n!
(x − a)n .
(a) For this exercise, approximate the standard normal probability distribution, f(x) = √
e− , x
π with a second-degree Taylor polynomial, centered on 0. In other words, find f( ), f ′ ( ), and f ′′ ( ), and plug them into f ′′ ( ) f(x) ≈ f( ) + f ′ ( )x + x. Your answer will be a quadratic function, so leave the xs alone. You are trying to find the coefficients for this quadratic approximation. (Hint: Find the derivatives first, and then plug in 0.) (b) Graph the standard normal distribution and the Taylor approximation you derived in (a) on the same graph. When does the Taylor polynomial do a good job approximating the function, and when does it do a poor job? 9. Suppose we are trying to create a mathematical model of a variable that has only two possible outcomes. Let the value of this variable for observation i be denoted yi , where if something is TRUE, yi = if something is FALSE. These sorts of variables are called binary variables. Some examples of important binary variables in the social sciences are whether or not a person joins a labor union,
CHAPTER 4. LIMITS AND DERIVATIVES
145
whether someone is married or single, whether or not someone is diagnosed with an illness, whether a country is a democracy or not, and whether a voter casts a vote for the incumbent or the challenger in an election. One way to model a binary variable like this one is with the Bernoulli distribution, which is given by the following formula: f(yi |p) = pyi ( − p) −yi . Here, p expresses the probability that yi = . Our goal is to calculate the most likely value of p given our observations of yi . To make an inference about the entire population instead of just one individual, we multiply the formulas for every individual together: N ∏ L(p|y , . . . ,yN ) = pyi ( − p) −yi . i=
This function is another example of a likelihood function (introduced in Exercise 10 in Chapter 1). As discussed in that exercise, social scientists usually prefer to work with a log-likelihood function, the natural logarithm of the likelihood function, instead of with a likelihood function. (a) Demonstrate that the log-likelihood function is equal to ( ) N N ∑ ∑ ℓ = ln(p) yi + ln( − p) N − yi . i=
i=
(Hint: You will need to use the rules of logarithms, summations, and long products described in Chapter 1.) (b) Find the derivative of the log-likelihood function with respect to p. (Hint: When we say “with respect to,” we are indicating the term that should be treated as the variable. Treat all other variables—every y term in this case—as if they are constant.) (c) Set the derivative you found in (b) equal to 0, and solve for p. As will be discussed in Chapter 5, your solution is a formula for the most likely value of p given the data.5 In substantive terms, discuss why the formula you just derived for p makes sense. (d) A very prevalent method used in the social sciences is logistic regression, also known as a logit model. A logit model involves the same math that you’ve completed in (a) through (c), but it has an additional step. Instead of simply solving for p, we substitute it with a function that has an independent variable x. In this way, we allow independent variables to have an effect on the binary dependent variable y. Suppose that the outcome we are trying to explain, yi , is whether a voter votes for the incumbent (yi = ) or for the challenger (yi = ). Let xi represent the political ideology of individual i as a 7-point scale from − to 3, where − means very liberal, 0 means moderate, and 3 means very conservative. One example of a logit model replaces p (in
5 In
statistics, this formula is what is known as a maximum likelihood estimate of the parameter p.
MATHEMATICS FOR SOCIAL SCIENTISTS
146
this case, the probability of voting for the incumbent) with the following function: pi =
. + e−( . + . xi ) Find the probabilities that a very liberal individual votes for the incumbent, a moderate individual votes for the incumbent, and a very conservative individual votes for the incumbent. (e) An important way to interpret the results of a logit model is to calculate the marginal change in probability, which is defined as the derivative of pi with respect to a particular x. Find a formula for dpi /dxi . Next, plug in xi = (a moderate voter), and report the value of dpi /dxi . (f) Provide a substantive description of your result in (e). What would it have meant if (dpi /dxi ) = when we plug in xi = ?
5
Optimization
Distribution the function Marginal that is graphed in Figure 5.1. The derivative, or slope, of the of X Marginalf(x) Distribution function at points A and D is positive, so as x increases will increase. As we x of X is getting closer and Marginal Distribution consider points closer and closer to points B and C, the slope of Y Distribution closer to 0, and at points B and C the derivative is exactly Marginal equal to 0.
C
of Y
Joint Distribution of X and Y
Figure 5.1
Joint Distribution of X and Y
Marginal Distribution A
A Graph With a Slope of X of 0 at Points B and C
A Marginal Distribution of X
yMarginalofDistribution Y ●
B
Joint Distribution and Y ● of XBB" A A A"
Marginal Distribution B of Y Joint Distribution of X and Y
ff(x) (x)
B
C
A
C
C
D
B
D D" ●D
D
f (y)
y y
xx
●
C C"
y
1 1
1
D B is a peak on the graph, But notice that points B and C are special points on this graph. and C is the bottom of a valley. So the derivative is telling us about these special features of the graph: Specifically, the derivative is 0 at points that are peaks or valleys on the graph. These peaks and valleys are substantively very important for social science research because y they tell us about the maximum and minimum values that f(x) takes on. Suppose that while reading a particularly frustrating article 1you throw your highlighter in the air. Imagine a function where the height of the highlighter depends on how long it has been in the air. For a while, the highlighter is going up. At these points in time, the derivative of the height function is positive: As time goes by, the highlighter is getting higher. After a while the highlighter falls back down. At these points, the derivative is negative: The height of the highlighter decreases as time goes by. But there is one instant where the highlighter stops getting higher and hasn’t yet started getting lower. At this point, the derivative is 0, and the highlighter has reached its maximum height.
y
f (x)
1
147
f (y)
MATHEMATICS FOR SOCIAL SCIENTISTS
148
There are many functions in the social sciences for which finding these maximum and minimum points is very important. First, researchers sometimes model a dynamic phenomenon such as public opinion. They will use the data at hand to create a function to predict future values of public opinion. Finding the maximum and minimum points of this function will provide a prediction for when some measure of public opinion, presidential approval, for example, will be at its highest and lowest values. In game theory, actors in political and economic situations take actions to achieve the best outcome. Each actor has a utility function over the set of possible actions. Given an action, an actor has a utility, and we say that actors try to choose the action that maximizes their utility. Here, we maximize the utility function. Most statistical models used in social science research use maximum likelihood estimation (MLE) methods. MLE chooses the values of the parameters such that the likelihood that the data we observe should have occurred is maximized. Here, we maximize a likelihood function. Finding these maximum and minimum points on a graph is a process called optimization. Optimization is the most important application of derivatives to social science research.
5.1 Terminology To maximize or minimize a function, the derivative of the function must be 0. In other words, for a differentiable function f, we require that f ′ (x) = at a peak or a valley. The values of x that make the derivative equal to 0 are called the critical points of the function. The peaks and valleys, the maxima and minima, occur precisely when x is a critical point. Let xc be a critical point of f. Then the value of the function at the critical point, f(xc ), is the height of the peak or valley. We say that f(xc ) is a relative extremum or a local extremum, an extreme value of the function in a localized area of the graph. There are three kinds of relative extrema: 1. Relative (local) maxima. A relative extremum is a local maximum if it is a peak on the graph. More specifically, a local maximum is the highest point in a localized area of the graph. 2. Relative (local) minima. A relative extremum is a local minimum if it is at the bottom of a valley on the graph. A local minimum is the lowest point in a localized area of the graph. The graph in Figure 5.2 has both a relative maximum and a relative minimum. 3. Saddle points. A saddle point is neither a peak nor a valley, so it is neither a local maximum nor a local minimum. As illustrated in Figure 5.3, saddle points occur when a function with a positive slope flattens out until the slope equals 0 (at a critical point) but then increases and has a positive slope again. Equivalently, a saddle point will occur if a graph has a negative slope, increases to a slope of 0, and then decreases to a negative slope again. A saddle point occurs precisely when the first derivative is equal to 0 and the second derivative is equal to . Consider what the second derivative describes: It is the rate of change
for a di↵erentiable function f , we require that f 0 (x) = 0 at a peak or a valley. The values of x that make the derivative equal to zero are called the critical points of the function. The peaks and valleys, the maxima and minima, occur precisely when x is a critical point. x1 the value of the function at the critical point, f (xc ), is height of the Let xc be a critical point of f . Then CHAPTER 5. OPTIMIZATION peak or valley. We say that f (xc ) is a relative extremum or a local extremum, an extreme value of the function in a localized area of the graph. There are three kinds of relative extrema:
Figure 5.2
A Graph With ax2Relative Maximum at f(x ) and a Relative Minimum at f(x )
Figure 31: A graph with axrelative maximum at f (x1) and a relative minimum at f (x2). 1
x 1) ff(x1) (x x12
Relative Extrema
Relative Maximum ●
f (x2) f (x1) Relative Minimum
y x2 2 ) ff(x2) (x
●
f 0(y) y
x1 x1
xx22 Critical Points
1
f 0(y) 1. Relative (local) maxima. A relative extremum is a local maximum if it is a peak on Figure 5.3 A Graph With a Saddle Point at f(x) More specifically, a local maximum is the highest point in a localized area of the graph.
the graph.
x extremum is a local minimum if it is at the bottom of a valley 2. Relative (local) minima. A1 relative on the graph. A local minimum is the lowest point in a localized area of the graph. The graph in figure 31 has both a relative maximum and a relative minimum. 3. Saddle points. A saddle point is neither a peak or a valley, so it is neither a local maximum or a y local minimum. As illustrated in figure 32, saddle points occur when a function with positive slope flattens out until the slope equals zero (at a critical point) but then increases and has positive slope again. Equivalently, aSaddle saddle point will occur if a graph has negative slope, increases to zero slope, point Saddle Point and decreases to negative slope again.
ff(x) (x)
●
Critical A saddle point occurs precisely whenpoint the first derivative is equal to zero and the second derivative is equal to zero. Consider what a second derivative describes: it is the rate of change of the slope of the function. If Joint the Distribution of X and Y If the slope is decreasing, the second derivative is the slope is increasing, second derivative is positive. negative. At a saddle point, the slope isfincreasing to one side of the point and the slope is decreasing on (y) f (x 1 ) point the slope is neither increasing or decreasing, so the second the other side of the point. At the saddle derivative is zero. Whenever x is a critical point and f 00 (x) 6= 0, then the relative extremum is a local maximum or minimum. Saddle point x
f (x2)
Critical Point Critical point 123
Joint Distribution of X and Y y
f (x1the ) second derivative is positive. If the of the slope of the function. If the slope is increasing, slope is decreasing, the second derivative is negative. At a saddle point, the slope is increasing x1 to one side of the point and decreasing on the otherf(x) side of the point. At the saddle point, the 1 slope is neither increasing nor decreasing, so the second derivative is 0. Whenever x is a critical point and f ′′ (x) ̸= , then the relative extremum f (x2) is a local maximum or minimum. It is misleading to classify local extremum, because even locally a x2a saddle point as a f(y) saddle point is never the highest or the lowest point on the graph. Just keep in mind that critical points can sometimes produce these saddle points instead of a peak or a valley.
x1 y
x2
149
MATHEMATICS FOR SOCIAL SCIENTISTS
150
Relative extrema (which are not saddle points) are only guaranteed to be the highest or lowest values in a specific region of the graph. In contrast, absolute or global extrema are the highest or lowest values on the entire graph. If a function is differentiable and unrestricted in its domain, the global extrema must be local extrema somewhere on the graph. In other words, the global maximum is the highest peak of the graph, and the global minimum is the lowest valley of the graph. If the function is not differentiable at some point, the graph may have a jagged point with no defined slope and no critical point. But this jagged point may be more extreme than any of the local extrema. If the graph has a closed, bounded domain, that is, x is restricted to be between certain values including the boundaries, then the boundary points may be higher or lower than any of the local extrema. At the boundary points, the slope does not necessarily equal 0. We look for critical points because they are the points at which the graph is as high or as low as it will go before it turns in the other direction. But at the boundaries, the graph may appear as if it will continue to go higher or lower, but it cannot since the domain does not include values beyond that point. Figure 5.4 illustrates how a function can have global extrema at the boundaries if the domain is bounded.1 Figure 5.4
A Function With Bounded Domain [− ,4] Where the Global Extrema Are on the Boundaries
x 60
global maximum
40
y
f (x)
20
f(x)
local maximum
global minimum
−20
f (y)
0
local minimum
−4
−2
0
2
4
xx
1
y
1 If
the domain is bounded over an open set (see Section 2.2) and a global extremum exists at the boundary, then the global extremum does not exist. Suppose the maximum of a function is at x = , but the domain is restricted to ( , ). Then the global maximum is infinitesimally close to the value of the function at x = , without ever reaching that point. f (x)
f (y)
CHAPTER 5. OPTIMIZATION
151
For example, Figure 5.4 contains a graph with a local maximum and a local minimum. But this graph is bounded so that x ∈ [− , ]. The points x = − and x = are not critical points, but since these are boundary points they may represent global extrema. Indeed, f(− ) is the global minimum, and f( ) is the global maximum. When checking for the global extrema, be sure to compare all of the local extrema and also any nondifferentiable or boundary points.
5.2 Finding Maxima and Minima To find the extreme points of a differentiable function, do the following: 1. Take the derivative. 2. Set the derivative equal to 0, and solve for x. These values of x are the critical points. 3. Use either the first derivative test or the second derivative test, described below, to see whether a local maximum, local minimum, or saddle point exists at each critical point. 4. If the domain of the function is bounded or if the function is not smooth for some points in the domain, do the following: • Discard any critical points that exist outside the domain. • Compare the largest local maximum inside the domain with the value of the function at the boundary points and the nonsmooth points. The largest of these values is the global maximum. • Likewise, compare the smallest local minimum inside the domain with the value of the function at the boundaries and the nonsmooth points. The smallest of these values is the global minimum. Example 1. Find the critical points of f(x) = x −
x +
x+
, where x ∈ [ , ].
First, we take the derivative of f(x). f ′ (x) = x −
x+
.
Next, we set the derivative equal to 0 and solve for x. The values of x are the critical points. f ′ (x) = , x −
x+
= ,
(x − x +
)= ,
(x − )(x − ) = . So x = and x = are the critical points. We can keep both answers since they both exist within the stated domain x ∈ [ , ].
After performing Steps 1 and 2, you have identified the critical points of the function. These critical points refer to a relative maximum, a relative minimum, or a saddle point. To
MATHEMATICS FOR SOCIAL SCIENTISTS
152
determine which kind of extremum exists at a critical point, Step 3 directs you to use either the first or the second derivative test. Either of these tests will work, so which one you use is simply a matter of preference. Theorem: The first derivative test If c is a critical point of f(x), and c ∈ [a,b] where there are no other critical points in [a,b], then 1. if f ′ (a) < and f ′ (b) > , then f(c) is a relative minimum; 2. if f ′ (a) > and f ′ (b) < , then f(c) is a relative maximum; and 3. if f ′ (a) and f ′ (b) are either both positive or both negative, then f(c) is a saddle point.
In other words, take a point just to the left and just to the right of each critical point, and make sure that there are no other critical points between these points. Plug the two points into the first derivative. If the derivative goes from positive to negative, the critical point produces a maximum since the shape of the graph resembles an arch. If the derivative goes from negative to positive, the critical point produces a minimum. Here the shape of the graph resembles a smile. If the derivative does not change signs, the critical point produces a saddle point. Let f be a differentiable function, xleft be a point just to the left of a critical point, and xright be a point just to the right of a critical point. The possible results for the first derivative test are described in Table 5.1. Table 5.1
The First Derivative Test
f ′ (xright ) positive f ′ (xright ) negative f ′ (xleft ) positive ′
f (xleft ) negative
Saddle point
Maximum
Minimum
Saddle point
Example 2. Use the first derivative test to see whether the critical points of f(x) = x −
x +
x+
, where x ∈ [ , ],
are relative maxima, relative minima, or saddle points. Let’s use x = and x = as the left and right points, respectively, for x = and x = for x = . Evaluating the derivative at these points, we find the following: f ′( ) = ( ) −
( )+
=
f ( )= ( ) −
( )+
=− ,
f ′( ) = ( ) −
( )+
=− ,
( )+
=
′
′
f ( )= ( ) −
and x =
,
.
The derivative goes from positive to negative for x = and from negative to positive for x = . Therefore, by the first derivative test, f( ) is a local maximum and f( ) is a local minimum.
CHAPTER 5. OPTIMIZATION
153
An alternative test is the second derivative test. Theorem: The second derivative test If c is a critical point of f(x), then 1. if f ′′ (c) < , then f(c) is a relative maximum; 2. if f ′′ (c) > , then f(c) is a relative minimum; and 3. if f ′′ (c) = , then f(c) is a saddle point.
In other words, take the second derivative of the function, and plug in each critical point. Local maxima form arches because the slope is decreasing, so the second derivative is negative. At a local minimum, the slope is increasing, so the second derivative is positive. As discussed earlier, the second derivative is 0 at a saddle point. Example 3. Use the second derivative test to see whether the critical points of f(x) = x −
x +
, where x ∈ [ , ],
x+
are relative maxima, relative minima, or saddle points. The second derivative is d f ′′ (x) = ( x − x + ) = x − dx
.
Plugging in the critical points, we get f ′′ ( ) = ( ) −
=− ,
f ′′ ( ) = ( ) −
=
and .
So, according to the second derivative test, f( ) is a local maximum and f( ) is a local minimum.
Finally, Step 4 tells you to compare the local extrema with each other and also with the boundaries if the domain is restricted. Suppose that a function f is bounded on a domain [a,b] and has three critical points, called c , c , and c . Then make a list of the values f(a), f(c ), f(c ), f(c ), and f(b). The largest value in this list is the global maximum, and the smallest value is the global minimum. Example 4. Find the global maximum and the global minimum of f(x) = x −
x +
, where x ∈ [ , ].
x+
The local extrema are f( ) and f( ). The boundary points are f( ) and f( ). Comparing all of these points, we see that f( ) = ( ) −
( ) +
( )+
=
,
f( ) = ( ) −
( ) +
( )+
=
,
f( ) = ( ) −
( ) +
( )+
=−
,
=
.
f( ) = ( ) − So f( ) = −
( ) +
( )+
is the global minimum, and f( ) =
and
is the global maximum.
MATHEMATICS FOR SOCIAL SCIENTISTS
154
5.3 The Newton-Raphson Method Sometimes, solving a function for 0 isn’t so easy. When a function can be algebraically solved for a certain value, then we say that there exists an analytical solution to that equation. Much of the time in the social sciences, we are dealing with highly complex equations that have no analytic solution. If there is a solution but no analytic solution, we have to employ methods other than algebra to find it. One common method is to use a convergence algorithm. An algorithm is a set of rules that should be looped through again and again until a particular criterion for termination is met. Rather than deriving an algebraic formula to represent a solution, convergence algorithms employ an “educated” form of guess-and-check. For example, to find the values of x that make a function f(x) equal to 0, the algorithm chooses some arbitrary value of x, plugs it in, and sees how close f(x) is to zero. If f(x) is close enough to 0 to satisfy the algorithm, it stops and reports its current value of x as the answer. If f(x) differs too much from 0, the algorithm uses some rule to choose a new value of x and tries again. Each try is called an iteration. When the algorithm chooses a satisfactory value for x, we say that the algorithm has converged. The following output from the statistical program Stata shows us how some statistical estimators use convergence algorithms: logit y x1 Iteration Iteration Iteration Iteration Iteration
0: 1: 2: 3: 4:
log log log log log
likelihood likelihood likelihood likelihood likelihood
= = = = =
-68.994376 -51.12034 -49.831461 -49.783434 -49.783336
Logistic regression
Number of obs = 100.00 LR chi2(1) = 38.42 Prob > chi2 = 0.0000 Log likelihood = -49.783336 Pseudo R2 = 0.2784 -----------------------------------------------------------------y | Coef. Std. Err. z P>|z| [95% Conf. Interval] -----------------------------------------------------------------x1 | 4.110327 .8614319 4.77 0.000 2.421952 5.798703 _cons | -4.284086 .9068329 -4.72 0.000 -6.061445 -2.506726 ------------------------------------------------------------------
The logistic regression is a maximum likelihood technique, which means that we maximize a likelihood function for the data.2 Solving the derivative of the likelihood function 2 The higher the likelihood function, the more likely it is that the data we observe are the norm rather than the exception. A likelihood function gives probabilities, which are between 0 and 1, so when we take the natural logarithm, the values become negative. We use a logarithm because it breaks multiplicative terms into additive terms, but we then have to deal with negative numbers.
CHAPTER 5. OPTIMIZATION
155
for 0, however, is a complicated task, so it is easier to use a convergence algorithm to find the coefficient estimates. The iterations at the top of the output are the iterations of the convergence algorithm. The values of log likelihood are the values of the function as the derivative gets closer to 0. After four iterations, the derivative is close enough to 0 to satisfy the algorithm, so the program reports the value of log likelihood at that point and the coefficient estimates that maximize log likelihood. The Newton-Raphson algorithm is a method to approximate the roots of a function, that is, the values of x that make f(x) = . A second version of the Newton-Raphson algorithm, described below, approximates the critical points of a function. For the algorithm to work, the function must be differentiable, and the user of the algorithm must specify some initial value for x, called x . The algorithm follows these rules: Algorithm: The Newton-Raphson method for finding the roots of a function f(x) = . 1. Choose an initial value x . 2. Find a new value of of x using the following formula: x =x −
f(x ) . f ′ (x )
3. In general, find subsequent new values of x using xn+ = xn −
f(xn ) . f ′ (xn )
4. If f(xn ) is closer to 0 than some prespecified tolerance level, stop. If not, return to Step 3.
The algorithm will start by looking at x , and if f(x) is not close enough to 0, the algorithm will choose new values of x based on the initial specification of x . If x is specified to be very close to the true root, then the algorithm will converge quickly. But if x is far away from the true root, then the algorithm might take many more iterations to converge. So, sometimes, the choice of x is very important. The other consideration in choosing x is that sometimes a function has multiple roots. The algorithm will converge on only one root, which may not be the root that you, as a researcher, are interested in. Choosing a value of x close to the correct root will make it more likely (but it is not guaranteed) that the algorithm will converge on the root you want. If you have a theoretical expectation for where the root ought to be, the theory can inform the selection of x . The root is on the x-axis, so when f(xn ) is positive, we need f ′ (xn ) to be negative to go down, and when f(xn ) is negative, we need f ′ (xn ) to be positive to go up. If the derivative is not in the direction we want, we need to take x backward to move in the correct direction. Therefore, when the function and the derivative are of different signs, the algorithm adds to x. When the function and the derivative are of the same sign, the algorithm subtracts from x. The behavior of the algorithm is summarized in Table 5.2. Also, notice that as we approach the root, the function is getting closer to 0, so we are adding or subtracting less and less to the previous values of x to get new values of x. In other
MATHEMATICS FOR SOCIAL SCIENTISTS
156
Table 5.2
Behavior of the Newton-Raphson Algorithm
f ′ (xn )
f(xn )
Add to xn to get xn+
Subtract from xn to get xn+
words, as we get close to the root, we take smaller steps. Stata uses a much more sophisticated version of a convergence algorithm that can handle multiple variables. But knowing how to perform the algorithm in the univariate case is important for understanding convergence, in general. To find a critical point, we are trying to find the value of x such that f ′ (x) = . So we plug f ′ in for f in the Newton-Raphson algorithm, xn+ = xn −
f ′ (xn ) , f ′′ (xn )
and follow the same steps outlined above: Algorithm: The Newton-Raphson method for finding the critical points of a function f ′ (x) = . 1. Choose an initial value x . 2. Find a new value of of x using the following formula: x =x −
f ′ (x ) . f ′′ (x )
3. In general, find subsequent new values of x using xn+ = xn −
f ′ (xn ) . f ′′ (xn )
4. If f ′ (xn ) is closer to 0 than some prespecified tolerance level, stop. If not, return to Step 3.
Example 1. Use the Newton-Raphson algorithm to find the positive critical point of f(x) = x + x − x+ . We can use the techniques described in the previous section to prove that the critical points of this function are x = and x = − . We want to use the Newton-Raphson algorithm to find the positive critical point, so let’s start with x = . We will have to evaluate the derivative, f ′ (x) = x + x − , and the second derivative, f ′′ (x) = x + , for each new value of x.
• Iteration 0:
x =
.
CHAPTER 5. OPTIMIZATION
157
• Iteration 1: x =x −
f ′ (x ) = f ′′ (x )
( ) + ( )− ( )+
−
= .
.
• Iteration 2: x =x −
f ′ (x ) = . f ′′ (x )
−
( .
) + ( . ( . )+
)−
= .
.
x =x −
f ′ (x ) = . f ′′ (x )
−
(.
) + (. ( . )+
)−
= .
.
x =x −
f ′ (x ) = . f ′′ (x )
−
(.
) + (. ( . )+
)−
= .
.
• Iteration 3:
• Iteration 4:
We indeed are converging on the critical point x = . Note, however, that had we chosen a starting value closer to x = − , we may have ended up converging on this critical point instead.
Exercises 1. Find the global minimum and global maximum of the following functions over the stated domains: (a) f(x) = x − x −
x , where x ∈ [− , ].
(b) g(x) = x ln(x) − x, where x ∈ ( , ]. 2. Consider the function f(x) = x −
x +
x+
on the domain x ∈ [ , ]. (a) Find the critical points, global minimum, and global maximum of f(x). (b) Using a starting value x = , perform the Newton-Raphson algorithm to find the critical points of the function f(x) = x − x + x + . Write down every iteration, and keep going until you are convinced that the algorithm is converging on a specific value. What is this value? Now use x = as a starting point, and repeat the NewtonRaphson algorithm. (c) What are two problems the Newton-Raphson algorithm encounters in trying to maximize f(x)? 3. In 1995, and again in 2013, the federal government shut down over an impasse in budget negotiations between the Republicans in the House of Representatives and the Democratic president. To some extent, the Republican members of the House of Representatives benefitted from the government shutdown. The shutdown allowed these members to champion policy positions that were very popular in their constituencies,
MATHEMATICS FOR SOCIAL SCIENTISTS
158
and the shutdown may have resulted in a favorable policy outcome for Republicans. As the shutdown continued, however, there were fewer new opportunities for Republican members to take positions, and there was less of a chance that a deal would favor Republican policy preferences. At the same time, House Republicans were facing mounting electoral costs as the government remained closed.3 Suppose that we have created a mathematical model of the costs and benefits the House Republicans derived from the shutdown. Let t be the number of days since the shutdown began. A function that describes the benefits the House Republicans derived from the shutdown after t days is b(t) =
ln(t + ),
and a function that describes the costs incurred by House Republicans after t days of government shutdown is c(t) =
t
.
100
Benefits
50
Costs/Benefits
150
Here is a graph of these two functions:
0
Costs
0
10
20
30
40
Days
Suppose that the overall utility (U) that the House Republicans derive from the shutdown is equal to the benefits minus the costs: U(t) =
ln(t + ) −
t
.
How long should the shutdown last for the House Republicans to maximize their utility? Also, demonstrate that your answer indeed represents the global maximum. (Hint: Remember the quadratic formula?)
3 See, for example, Dutton, S., De Pinto, J., Salvanto, A., & Backus, F. (2013, October 3). Poll: Americans not happy about shutdown; more blame GOP. CBS News. Retrieved from http://www.cbsnews.com/news/poll-americans-nothappy-about-shutdown-more-blame-gop/.
CHAPTER 5. OPTIMIZATION
159
4. The simplest possible linear regression model (see Section 2.5.3) has just one independent variable, and no y-intercept term. It has this formula: yi = βxi + ϵi . In this model, yi is the dependent variable, and xi is the independent variable. The subscript i refers to an individual observation. Here, we are saying that a value of yi is equal on average to β times the value of xi . The difference between yi and βxi is called the error, and is denoted ϵi . Our goal is to find the value of β that provides the best fit for the observed data by making the errors as small as possible. Specifically, we want to minimize the sum of squared errors.4 The formula for a squared error is ϵi = (yi − βxi ) = (yi − βxi yi + β xi ). The sum of squared errors is then given by N ∑ i=
ϵi =
N ∑ (yi − βxi yi + β xi ). i=
Find a formula for the value of β that minimizes the sum of squared errors. Also, demonstrate that this value of β really describes a local minimum. Some hints and points to remember: • Since you are solving for β, you should take the derivative of the sum of squared errors with respect to β. • Treat xi and yi as constants since these are observed data points whose values are known. • Since a derivative∑can be broken ∑ d up over addition, you can bring a derivative inside d a summation: dβ f(β) = dβ f(β). • Since xi and yi have a subscript i, they must stay inside the summation. But β can be brought outside the summation. 5. We’ve collected data on a variable x from a random sample of the population of the United States. We believe that the population is normally distributed with a standard deviation of 1, but we don’t know the mean of x in the population. In this problem, we’re going to use maximum likelihood statistics to estimate the population mean. Each data point xi is assumed to be independent and normally distributed with standard deviation = 1. The mean is not known, so we denote it as μ. That means that each xi has the probability density function f(xi ) = √
π
e−.
(xi −μ)
.
4 We consider a sum because we want to create a model that performs well for all of the observed data, and we square the errors so that positive and negative errors do not cancel out, and so that large errors are penalized more than small errors.
MATHEMATICS FOR SOCIAL SCIENTISTS
160
The likelihood function (L) is the joint probability distribution of all the data points. Since they are independent, we can multiply all of the distributions together: L(μ) =
n ∏ i=
√
π
e−.
(xi −μ)
.
We’ve gone out and measured x ,x , . . . ,xn , so we can treat those values as known constants. The only variable here is μ. We want to choose the estimate of μ that maximizes the likelihood function, because that way it’s most likely that the data we observed are the data we ought to have observed. It is almost always easier to maximize the natural logarithm of the likelihood function than the likelihood function itself, and this transformation does not change the maximum. The natural logarithm of the likelihood function is equal to (∏ ) n √ e−. (xi −μ) . ℓ(μ) = ln π i= In Exercise 10 from Chapter 1, you showed that this log-likelihood function can be rewritten as n ∑ −n ln ( π) −. (xi − μ) . ℓ(μ) = i=
Your task in this problem is to find the value of μ that maximizes this log-likelihood function. Your answer will be our best guess as to the true value of μ in the population, given the data. (a) Find
dℓ . dμ
(b) Set the derivative equal to 0, and solve for μ. (c) Use the second derivative test to show that this critical value of μ refers to a maximum. (d) Look at the answer you derived for μ in (b). How would you describe this estimate, in plain English? Does this estimate make sense?
Integration
6
I
Chapter 4, you learned how to find the derivatives of functions and how to interpret these derivatives. But differentiation is a mathematical operation, and every operation in mathematics has an inverse operation to reverse it. The inverse operation of differentiation is integration.
6.1 Informal Definitions of an Integral You should now be able to explain that the derivative of the function f(x) = x is f ′ (x) = x, where the derivative expresses the slope of f(x) at any point x. These graphs are drawn in Figure 6.1. The point ( . , . ) exists on the graph of f(x) (in the left panel) since . = . . Since x is the derivative of x , the point ( . , ) exists on the graph of f ′ (x) since the slope of the curve f(x) = x at the point x = . is 1. We say that x is the derivative of x , but we can also say that x is an antiderivative of x.1 The formal definition of an integral is presented in Section 6.3, but one informal way to define an integral is as an antiderivative of a function. Informal Definition: Integral (Version 1) An integral is an antiderivative. Taking an integral cancels out taking a derivative. Integrals used in this way are called indefinite integrals.
But Figure 6.2 shows us that there’s something else going on here. Define the “area under the curve” to be the area between the graph and the x-axis. We don’t consider the region underneath the x-axis to count as positive area under the curve.2 The area under the curve of f ′ (x) = x is a triangle, so we can use the formula A = bh (one half the base, times height) to calculate the area between points 0 and x on the x-axis. To find the area under the curve of f ′ (x) = x from 0 to 0.5, note that the base is of length 0.5 and the height is 1, so the area is 0.25. But the value of the original function, f(x) = x , at 0.5 is also 0.25. So the value of f( . ) is equal to the area under the curve of f ′ (x) from 0 to 0.5. In the table below Figure 6.2, we see that this pattern continues for other values: f( ) is equal to the area under x from 0 to 1, f( ) is equal to the area under x from 0 to 2,… f( ) is equal to the area under x from 0 to 10, and so on. 1 Technically, all functions of the form x + c, where c is a constant, are antiderivatives of x because the constant disappears in the derivative. But for the purposes of finding areas under curves, we drop the constant. 2 Later in this chapter, we’ll say that curves that dip below the x-axis have negative area under the curve. That is, they have ground to make up before we can even count any positive area for these curves.
161
x
y MATHEMATICS FOR SOCIAL SCIENTISTS
162
f (x)
x
1.0
f (y)
0.4
0.8
0.6
0.8
1.0
xx
f'(x)
f 0(x)
2.0 0.5
0.2
f 0(y)
and Y 0.4Joint Distribution 0.6 0.8 of X1.0
0.2
f (x1)
xx
(.5, 1)
●
Slope at x = .5 is 1
(.5, .25)f ●(y) Slope at xx== .5 .5 isis11 Slope at Area under the curve from 0 to 1 = .25
f (x)
1.0
0.6
f (x)
0.0
0.4
1.5
f (y)
0.0
f (y)
0.2
y
0.4
f(x)
f (x)
0.0 1
An Illustration That the Area Under the Curve Can Be Calculated With an f (x) Antiderivative y y
0.8
y1
1.0
1.0
1.0
x
●
0.5
0.6
xx Figure 6.2
0.0
1.5
0.2
f x0(y)
(.5, 1)
f (y)Area thethe Areaunder under
curve from 1 .5 curve from0 0toto .25 ==.25
0.0
0.0
f 0(x) ●
0.0
f (y)
f'(x)
0.6
(.5, .25)
0.2
f (x)
0.4
f(x)
0.8
y
2.0
The Graph of f(x) = x (Left) and Its Derivative f ′ (x) = x (Right)
Figure 6.1
Joint Distribution of X and Y
0.0 1
0.2
1) 0.6 f (x0.8
0.4
1.0
x x f (x2)
f (x2)
x
xy
0.5 0.25 1
1
f1(x)
Area Under the Curve of 2x From 0 to xy
( )( ) =
1
f (x)
4
10
100 f (y) x
(
y
x2
1
x2
( )( ) =
2
x
x1
( .x1 )( ) = .
)(
y
)=
f (y)
1
(x)( x) = x 1
In general, the area under the curve of a function from the beginning of its domain to x is the same as the value of the antiderivative at x. Therefore, we can write the definition of the integral in a second way.
1
1
CHAPTER 6. INTEGRATION
163
Informal Definition: Integral (Version 2) An integral measures the area under the curve of a function between two specified boundaries on the x-axis. Integrals used in this way are called definite integrals.
Calculus courses today focus on these two definitions of an integral separately. The reason for this separation is that for hundreds of years mathematicians were using antiderivatives and were interested in approximating areas under curves, but they were unaware that the two exercises were actually the same thing. When it was discovered that antiderivatives were the same as formulas to find areas, it was a vastly important discovery and did a lot to advance the development of modern mathematics and physics.
6.2 Riemann Sums Consider for now the definition of an integral as a measurement of area under a curve. Suppose we have to find the area from x = to x = of the graph pictured in Figure 6.3.
x
A Function That Can Be Integrated
2
f (x)
4
f(x)
6
y
8
Figure 6.3
0
f (y) 0
1
2
3
4
5
6
7
xx For many functions, there is a way to analytically solve for this area. That is, there is a way to work out the exact value of the area by actually solving the integral. I discuss methods for y solving integrals later in this chapter. The area under the curve of any function, however, can be approximated by using a numerical method. Numerical methods are ways to approximate values that we cannot obtain by solving an equation or deriving a formula. Analytic solutions 1 are best, but numerical solutions are viable alternatives when something cannot be solved f (x) to be neatly integrated, and since no analytically. Some functions are much too complicated
f (y)
164
MATHEMATICS FOR SOCIAL SCIENTISTS
analytic solution exists in these cases, we have to approximate the area under the curve with a numerical method. Computers use numerical methods all the time, and understanding when and how computers approximate important values will be important for understanding the applied statistics you will use in your research. Integration is a general method for finding the area, but if you can remember all the way back to your course on geometry, you might recall that we already have formulas to find the areas of particular kinds of shapes. Specifically, remember that the area of a rectangle is the length of the rectangle times the width of the rectangle. A trapezoid is like a rectangle, but one of its long sides is shorter than the other. The area of a trapezoid is the distance between the two long sides times the average of the lengths of the two long sides. If the shape in Figure 6.3 could be broken up into rectangular or trapezoidal regions, then we would not need any integration to find the area under the curve. Unfortunately, the area in Figure 6.3 does not break down exactly into rectangular or trapezoidal parts. But observe that rectangles and trapezoids can nicely approximate this area. Figure 6.4 shows four ways to approximate the area under the curve. These kinds of integral approximation are called Riemann sums. Bernhard Riemann was a 19th-century German mathematician and is known as one of the greatest mathematicians of all time. His doctoral adviser was Carl Friedrich Gauss, whose work led to the development of modern physics and statistics. Riemann made a number of important contributions to number theory that help mathematicians understand how the prime numbers are distributed within the set of positive integers.3 Integral approximations through Riemann sums form the basis of Gaussian quadrature, which is the primary method used by computers to solve complicated integrals. There are different ways to compute a Riemann sum. But the general process is as follows: 1. Divide the domain (in this case x ∈ [ , ]) into n equal parts. As shown in Figure 6.5, as the number of parts increases, the accuracy of the approximation also increases. 2. Draw a line from each partition point on the x-axis up to the curve. 3. Choose the shape and height of each area partition based on the rule you want to use: • For a left Riemann sum, draw rectangles using the left points as the height. This method is demonstrated in the top-left corner of Figure 6.4. • For a right Riemann sum, draw rectangles using the right points as the height, as shown in the top-right corner of Figure 6.4. • For a midpoint Riemann sum, draw rectangles using the height of the curve halfway between the two points as the height, as shown in the bottom-left corner of Figure 6.4. • For a trapezoidal Riemann sum, draw trapezoids by drawing a straight line between the two points on the curve, as shown in the bottom-right corner of Figure 6.4. 3 One of these contributions is the Riemann hypothesis, which provides a general formula for the distribution of prime numbers but is unproven. It has been demonstrated to be true for at least the first 10 trillion roots of the function that expresses the prime numbers. Computers will never be able to provide a proof, however. The Riemann hypothesis is one of the six remaining unsolved problems, called the Millennium problems, for which the Clay Mathematics Institute has promised $1 million to the researcher who finds a proof. Only one of these problems, the Poincar´e conjecture, has been solved. The rest remain as the most important unsolved problems in mathematics.
CHAPTER 6. INTEGRATION
x
Figure 6.4
165
x
Various Kinds of Riemann Integral Approximation
Left
Right
y 8 6 2 0
0
f (y)
4
f(x)
f (x)
2
f (x)
4
f(x)
6
8
y
0
1
2
3
4
5
6
7
x x
x
f (y)
0
1
6
7
6
7
8
f (x)
6 f(x)
f (x)
f (x)
4
6
8
5
Trapezoid y
y
4
f(x)
4 x
x
y
f (x)
3
x
Midpoint
y
2
2
2
1
1
0
1
2
3
4
f (y)
0
f (y)
0
f (y)
5
6
7
f (y)
0
x
1
2
3
4
5
x
x
x
y approximations Most of the time, the ymidpoint and trapezoidal rules provide better than the left and right rules. The formulas for the Riemann sums listed below may look complicated, but they only f (x) (x) let a be the describe adding up the areas of rectangles or trapezoids. In the formulasfbelow, left end point for the domain1 of x, and let b be the right end point, so that x1 ∈ [a,b]. Also let 1 1 n equal the number of partitions and A equal the area. For a left Riemann sum,
f (y)
( ) n− ∑ b−a (b − a)i A≈ f a+ . n n i=
f (y)
MATHEMATICS FOR SOCIAL SCIENTISTS
166
Figure x 6.5
The Quality of a Riemann Approximation x Increases With the Number of Partitions
3 Partitions
8
f (y) 0
1
2
3
4
5
6
7
0
2
4
f(x)
4
f (x)
2
f (y)
0
f(x)
f (x)
6
8
y 6
y
20 Partitions
0
1
2
3
4
5
6
7
xx
xx
The summation sign represents the fact that we are adding together the areas of many (b−a)i b−a rectangles. Within the summation, rectangle. b−a n f(a + n ) is the area of one y n is y the width of the rectangle. Each rectangle has the same width: the length of the domain (b − a) divided by the number of partitions n. So if the domain is [ , ] (as in the graph in Figure 6.3) and we create seven partitions, then each rectangle has a width of − = . The rectangles have different heights, which depend both on the value of f(x) and on the f (x) f (x) 1 rule1 you choose to use. The argument of the function a + (b−a)i describes the points where n the top of each rectangle will be drawn. The i term refers to the index of the summation, which for a left Riemann sum begins at 0 and goes to n − . That is, the left Riemann sum draws the height of each rectangle at the left point of each consecutive pair of points. The ) first rectangle in the left Riemann ) = f( ). The 6.4 is drawn at f( + (f−(y) f (y) sum in Figure ( − ) top of the second rectangle is drawn at f( + ) = f( ). Likewise, the other rectangles are drawn with heights f( ), f( ), f( ), f( ), and f( ), respectively. So in this example, the left Riemann sum is × (f( ) + f( ) + f( ) + f( ) + f( ) + f( ) + f( )). The formula for a right Riemann sum, A≈
( ) n ∑ b−a (b − a)i f a+ , n n i=
is very similar to the formula for a left Riemann sum. The only difference is that the summation indexes i from 1 to n instead of from to n − . This means that the right Riemann sum draws the height of each rectangle at the right point of each consecutive pair of points. The 1 1 first rectangle in the right Riemann sum in Figure 6.4 is drawn at f( + ( − ) ) = f( ), and
CHAPTER 6. INTEGRATION
167
the other rectangles are drawn with heights f( ), f( ), f( ), f( ), f( ), and f( ), respectively. So in this example, the right Riemann sum is × (f( ) + f( ) + f( ) + f( ) + f( ) + f( ) + f( )). The formula for a midpoint Riemann sum is ( ) n ∑ b−a (b − a)(i − . ) A≈ f a+ , n n i=
which draws the height of each rectangle at the midpoint on the x-axis between each consecutive pair of points. The first rectangle in the midpoint Riemann sum in Figure 6.4 is drawn at f( + ( − )( −. ) ) = f( . ), and the total Riemann sum is × (f( . ) + f( . ) + f( . ) + f( . ) + f( . ) + f( . ) + f( . )). The trapezoidal Riemann sum uses a slightly different formula: [( ) ( )] (b−a)i ) n− + f a + (b−a)(i+ ∑ n b−a f a+ n A≈ . n i=
This formula uses the area of a trapezoid, the length of the base times the average of the heights, instead of the area of a rectangle. Like the rectangle, the formula for length of the base b−a n is the same for every trapezoid. The trapezoid Riemann sum calculates the average of the height of the function at the right and left points of each consecutive pair of points. The first average in the trapezoidal Riemann sum in Figure 6.4 is [f( ) + f( )]/ , and the total Riemann sum is ( ×
[f( ) + f( )]
+
+
[f( ) + f( )]
[f( ) + f( )]
+
[f( ) + f( )] [f( ) + f( )] [f( ) + f( )] + + + ) [f( ) + f( )]
( ) = . × f( ) + f( ) + f( ) + f( ) + f( ) + f( ) + f( ) + f( ) . Example 1. Use left and trapezoidal Riemann sums with six partitions to approximate the area under the curve of f(x) = x − x + for x ∈ [ , ].
• Plugging in the values a = , b = , and n = , we can use the formula for a left Riemann sum: A≈
∑
( ) i f
i=
=
[ ] f( ) + f( . ) + f( ) + f( . ) + f( ) + f( . )
= . [ + . = .
.
+ + .
+ + .
]
MATHEMATICS FOR SOCIAL SCIENTISTS
168
• Using the formula for a trapezoidal Riemann sum, A≈
∑ i=
=
[( ) ( )] f i + f (i+ )
[
] f( ) + f( . ) + f( ) + f( . ) + f( ) + f( . ) + f( )
= . [ + ( . =
.
)+ ( )+ ( .
)+ ( )+ ( .
)+
]
.
6.3 Integral Notation Before we discuss how to solve integrals analytically, it will help a great deal to discuss the notation that you will encounter when working with integrals. Recall that for derivatives two sets of notation are used to respect the two academic traditions (Newton and Leibniz) that developed differential calculus. For integrals, fortunately, there is only one prevalent style of notation, and it comes directly from the discussion of Riemann sums in the previous section. The area under a curve f(x) from x = a to x = b is approximated by a Riemann sum. Take, for example, the right Riemann sum: ( ) n ∑ b − a a + (b − a)i A≈ f . n n i=
As the number of partitions increases, this approximation becomes more accurate. So if there are infinitely many partitions, the Riemann sum should be exactly equal to the area under the curve: ( ) n ∑ b − a a + (b − a)i A = lim f . n→∞ n n i=
The above equation is the formal definition of an integral.4 A function is called integrable if this limit actually exists. The limit will not exist if there is any one partition with infinite (or negatively infinite) height, which happens when a function has asymptotes. In practice, the parts of this limit are written in different ways to make integrals easier to recognize. If the number of partitions is infinitely large, then the distance between partition points is infinitely small. Infinitely small values are said to be infinitesimal. Infinitesimal numbers are greater than 0 but are smaller than any positive real number. The distance between partition points is the difference between a value of x and the previous value, which can be written in notation familiar to physicists as Δx. To express the fact that this distance is infinitesimal, however, we instead write dx. 4 The official mathematical definition of an integral is the expression that simultaneously equals the limit as the number of partitions approaches infinity of the upper Riemann sum (the sum that uses the larger height for each consecutive pair of partition points) and lower Riemann sum (the sum that uses the smaller height for each consecutive pair of partition points).
CHAPTER 6. INTEGRATION
169
In the Riemann sum, the distance between partition points is denoted (b − a)/n. In integral notation, the distance is simply denoted dx. The fact that we use x here is simply because x is the independent variable in the function f(x). If instead we wanted to integrate a function g(y), we would use dy. The distance between partitions, dx, is the width, and the value of the function itself, f(x), is the height. Therefore, the area of one partition is f(x)dx. To express that infinitely many partitions are added together, we use an integral sign rather than a summation sign. An integral sign looks like a long, skinny “S,” where the lower bound of x can be listed on its bottom right and the upper bound of x can be written on its upper right. Therefore, the integral, the area under the curve, can be written as ( ) n ∑ b−a (b − a)i A = lim f a+ n→∞ n n i= ∫ b = f(x)dx. a
The function f(x) inside the integral is called the integrand. The bounds to the right of the integral sign are optional. When the bounds are listed, the integral is called a definite integral, which means that you intend to find the exact area under the curve of f(x) between the bounds. When the bounds are not listed, the integral is called an indefinite integral, which means that you are more interested in finding the antiderivative of f(x). The antiderivative of f(x) itself is denoted with a capital F: ∫ F(x) = f(x)dx. When evaluating a definite integral, first the antiderivative is found, and then the difference of the antiderivative between the bounds is evaluated. The following notation will also be important: b F(x) = F(b) − F(a). a
Calculus students, who are assigned hundreds of integrals to practice over the course of the semester, often get lazy and stop writing dx inside the integrals. Hopefully, however, this section has made it clear why dx is an important part of the notation and should not be omitted. The width notation becomes even more important in multivariate integration (discussed in the next chapter) because dx identifies the variable over which integration should occur. Be sure to include the dx when writing integrals. Example 1. In the example in the previous section, the area under the curve of f(x) = x − x + for x ∈ [ , ] was approximated using Riemann sums. To find the exact area, a definite integral must be set up and solved. This integral is ∫ A= x − x + dx.
MATHEMATICS FOR SOCIAL SCIENTISTS
170
6.4 Solving Integrals As stated above, there are two types of integrals: (1) indefinite integrals, with no listed bounds, and (2) definite integrals, where the bounds are listed. There are slightly different ways to solve each type of integral, but they share the same core techniques. First, I will discuss properties of integrals and the general rules to solve them. Then I will focus on techniques that are specific to each type of integral. The following properties apply to all kinds of integrals. Let c be a constant and f and g be integrable functions. 1. Constant factors can be brought outside an integral: ∫ ∫ cf(x)dx = c f(x)dx. 2. Integrals can be broken up across addition or subtraction: ∫ ∫ ∫ f(x) + g(x)dx = f(x)dx + g(x)dx, ∫ ∫ ∫ f(x) − g(x)dx = f(x)dx − g(x)dx. 3. The first fundamental theorem of calculus: Integration and differentiation are inverse operations. ∫ d f(x)dx = f(x), dx ∫ f ′ (x)dx = f(x). There is a slightly different version of this theorem for definite integrals, which is discussed with the techniques for solving definite integrals.
6.4.1 Solving Indefinite Integrals All integrals, definite and indefinite, are antiderivatives. An antiderivative of a function f is a function whose derivative is equal to f. This means that for all integrals there are some rules for integration that are the same rules you already learned for differentiation in Section 4.8, only reversed. Note that any function has infinitely many antiderivatives. For example, let f(x) = x . You should recognize that x is the derivative of x , so one antiderivative of f(x) is F(x) = x . But since constant addends disappear in differentiation, the derivatives of x + , x + , and x− are all also equal to x . To denote the class of antiderivatives, we write F(x) = x + c, where c is a general term for a constant. Always be sure to write the “+ c” at the end of solutions to indefinite integrals! The following rules for solving indefinite integrals also apply to definite integrals: 1. Let k be a nonzero constant. Constants become coefficients on x: ∫ k dx = kx + c.
CHAPTER 6. INTEGRATION
2. Add 1 to exponents and divide by the new exponent: ∫ xk+ xk dx = + c. k+ This rule also applies to fractions and to roots. Remember √ that functions of the form /xn can be written as x−n and that functions of the form n x can be written as x /n . Even though these exponents may not be positive integers, the rule still applies. 3. As with differentiation, ex is the only function that is fixed to integration: ∫ ex dx = ex + c. Figure 6.6
Dinosaur Comics by Ryan North, #359
Source: http://www.qwantz.com/index.php?comic=359
4. To integrate exponential functions (see Sections 4.8 and 1.5), we divide by the natural logarithm of the base. Let a be a nonzero constant: ∫ ax ax dx = + c. ln (a) 5. Recall from Section 4.8 that the derivative of a natural logarithm is 1 over the argument of the logarithm: d ln(x) = . dx x Working backward, the antiderivative of the reciprocal function is the natural logarithm, with one small change. The reciprocal function x can accept negative values
171
MATHEMATICS FOR SOCIAL SCIENTISTS
172
of x, but the logarithm cannot. So the antiderivative is the natural logarithm of the absolute value of x: ∫ dx = ln(|x|) + c. x Example 1. Solve
∫
x + x + dx.
First, we can break up the integral across addition: ∫ ∫ ∫ x dx + x dx + dx. Then we can use the rules listed above to solve each integral individually. The first integral is solved by Rule 2: ∫ x + c.
x dx =
The second integral is also solved by Rule 2, but first we can bring the constant outside the integral: ( ) ∫ ∫ x dx = x dx = x + c = x + c. Finally, the third integral is a constant, so we apply Rule 1: ∫ dx = x + c. So the entire indefinite integral is ∫ x + x + dx =
x + x + x + c.
Note that we didn’t need to write “+ c” three times, since all the arbitrary constants are combined in the one constant term at the end.
Example 2. Solve
∫ √
x dx.
Begin by rewriting the cube root as an exponent equal to ∫ x / dx. Then by applying Rule 2 we get ( ( / ) x +c= x /
Example 3. Solve
) /
+c=
∫ x
+
x
and bringing the constant out front:
+ ex dx.
x
/
+ c.
CHAPTER 6. INTEGRATION
173
We rewrite the problem: ∫
x− dx +
∫
∫ x
dx +
ex dx.
We then apply Rule 2 to solve the first integral: ( − ) ∫ x x− dx = +c=− + c. − x Note that we could rewrite the second integral as ∫ x− dx, but if we tried to apply Rule 2 to this integral, we would get ∫ x + c, x− dx = which does not exist due to the division by 0. Instead, we apply Rule 5 and get ∫ dx = ln(|x|) + c. x Finally, we apply Rule 3 to the third integral: ∫ ex dx = ex + c. When we add the three antiderivatives, we can combine the three constants into one c term. The entire antiderivative is − + ln(|x|) + ex + c. x
6.4.2 Solving Definite Integrals Indefinite integrals provide a way to find the antiderivative F(x) of a function f(x). To compute areas with definite integrals, however, we need one extra step. The second fundamental theorem of calculus: Let f be an integrable function in the domain x ∈ [a,b]. The area under the curve of f from x = a to x = b is given by b ∫ b f(x)dx = F(x) = F(b) − F(a). a
a
In other words, after finding the antiderivative in the same way as with an indefinite integral, plug the upper bound into the antiderivative, and subtract the value of the antiderivative at the lower bound.
MATHEMATICS FOR SOCIAL SCIENTISTS
174
Example 1. In Section 6.2, we approximated the area under the curve of f(x) = x − x + from x = to x = using the left and trapezoidal Riemann sums with six partitions. Now let’s see how well these approximations performed by finding the exact value. We need to solve ∫ A= x − x + dx. Using the rules from Section 6.4.1, the antiderivative is F(x) =
x
− x + x + c.
By the second fundamental theorem of calculus, the area under the curve from 0 to 3 is ( ) − + +c− − + +c = − + +c−c= = F( ) − F( ) =
. .
So the trapezoidal Riemann sum provided a more accurate estimate than the left Riemann sum.
Notice that in the example above, the general c term canceled out. In fact, c will always cancel out in definite integrals. The general constant must be the same whether we are considering F( ) or F( ); therefore the subtraction causes the constant to drop out. For this reason, adding c to the terms for evaluation of an integral is necessary for an indefinite integral, but not for a definite integral. Definite integrals measure area, but it should be pointed out that the area measured is bounded by the x-axis. When the function is negative (i.e., when its curve is below the x-axis), the area between the curve and the x-axis counts as negative area. If half the area between the curve and the x-axis exists above the x-axis and half exists below the x-axis, then the definite integral will be equal to 0. Example 2. Find the area under the curve of h(x) = x for x ∈ [− , ]. Half of the area is above the x-axis, and half is below, so the integral should evaluate to 0. ∫ (− ) x x dx = = − = − = . −
−
There are a few more properties of integrals that apply strictly to the bounds: 1. Reversing the bounds is the same as multiplying the integral by − : ∫ b ∫ a f(x)dx = − f(x)dx. a
b
2. Let x ∈ [r,t], and let s be a number between r and t. Instead of integrating over the entire domain of x, we can integrate over the parts [r,s] and [s,t] and add the integrals together: ∫ ∫ ∫ t
s
f(x)dx = r
t
f(x)dx + r
f(x)dx. s
CHAPTER 6. INTEGRATION
175
3. The first fundamental theorem of calculus, for bounds: Again, this theorem expresses that derivatives and integrals cancel each other out. But if there are bounds involved, then an extra step may be necessary to evaluate an expression that has both a derivative and an integral. Derivatives are always taken with respect to a particular variable y. If this variable is the upper bound of a definite integral, then that variable is plugged into the function after cancellation: ∫ y d f(x)dx = f(y). dy a If the upper bound is a function of y, then the chain rule applies (see Section 4.9): ∫ g(y) d f(x)dx = f(g(y))g′ (y). dy a If the variable, or a function of the variable, is the lower bound, then reverse the bounds and multiply the function by − : ∫ a ∫ y d d f(x)dx = − f(x)dx = −f(y), dy y dy a and d dy
∫
a
f(x)dx = − g(y)
d dy
∫
g(y)
f(x)dx = −f(g(y))g′ (y).
a
If neither bound contains the variable that the derivative is taken with respect to, then the area under the curve is treated as constant, so the derivative is 0. Example 3. Simplify d dy
∫
a y
√ e
x
x
dx.
Since we are taking the derivative of an integral, there’s nothing we need to do with the ugly integrand. We are differentiating with respect to y, so if y were the upper bound, we would just plug y into the integrand and we would be done. But y is the lower bound, so we need the extra step of reversing the bounds and making the integral negative: ∫ a√ ∫ y√ √ y d x d x dx = − dx = − . dy y e x dy a e x ey
Example 4. Simplify d dy
∫
y +y − a
√ e
x
x
dx.
Here, the upper bound is a function of y, g(y) = y + y − , so we plug g(y) into the integrand and multiply by the derivative of g(y): √ y +y − ( y + y). e ( y +y − )
MATHEMATICS FOR SOCIAL SCIENTISTS
176
6.5 Advanced Techniques for Solving Integrals You should now have the basic tools necessary for solving indefinite and definite integrals. But at this point, the range of integrals that you are able to solve is rather limited. The truth is some integrals cannot be solved analytically. For example, the most important function in basic statistics is the standard normal density function: e− . π To calculate probabilities, we need to integrate this function, but unfortunately, no analytic solution exists for this integral. We have no nice and neat formula for the antiderivative, so statisticians resort to approximations and students resort to looking up values of the definite integral for particular bounds (z-scores) in tables in the back of statistics textbooks. Although it may not be possible to integrate the normal distribution, many other functions that are difficult to integrate can be solved with a few important tricks that are discussed here. f(x) = √
x
6.5.1 u-Substitution Essentially, u-substitution is the chain rule for differentiation in reverse. This technique allows us to solve integrals such as ∫ xex dx. Suppose we make the following substitution: u=x . Then, if we take the derivative of both sides with respect to x, we have du = x. dx Multiplying both sides by dx and dividing both sides by 2, this becomes du = x dx. Therefore, substituting u for x and ( / )du for x dx, we can rewrite the above integral as ∫ eu du, which we can now solve: eu + c. Finally, substituting x back for u, we arrive at the final answer: ex + c. The whole trick with u-substitution is to find some part of the integrand that resembles the derivative of another part. In the above example, x only differs from the derivative of x by a constant factor. u-substitution works because some part of the integrand is replaced by du
CHAPTER 6. INTEGRATION
177
along with dx. u should be chosen so that its derivative will include some part of the integrand to simplify the integral. To perform u-substitution, follow these steps: 1. Choose some part of the integrand to be u such that its derivative also appears in the integrand. 2. Take the derivative of u, and multiply both sides by dx. 3. If there are any constant factors that appear with dx that are not in the original integral, divide to bring them with du. 4. Substitute u and du into the integral. Any constant factors with du can go in front of the integral. 5. Solve the simplified integral, using u as the argument instead of x. 6. For indefinite integrals, substitute the expression with x back in for u in the antiderivative. 7. An optional step for definite integrals, only: derive new bounds by finding the value of u at each bound. Plug these new bounds in for u in the antiderivative and subtract. This step is easier in some situations, but it is also valid to substitute the expression with x back in for u, as in step 6, before plugging in the bounds. Sometimes u-substitution requires quite a lot of guess-and-check work. Try a candidate for u, take the derivative, and see whether another part of the integrand is replaced, making the integral easier. If not, try another u. Example 1. Solve
∫
ln (x) dx. x Since the derivative of ln (x) is /x (see Sections 1.5 and 4.8), we should use u-substitution. Let u = ln (x); then the derivative is du = . dx x Solving for du, this becomes du = dx. x After making the two substitutions, the integral becomes simply ∫ u du, which evaluates to
u
+ c.
Substituting ln (x) back for u, the solution is ln (x)
Example 2. Solve
∫ x
+ c.
√ x −
dx.
MATHEMATICS FOR SOCIAL SCIENTISTS
178
Let u = x −
; then the derivative, solved for du, is du = x dx.
The constant factor 3 does not appear in the integrand, so we divide to bring it with du, ( / )du = x dx. The new bounds are u( ) =
−
=
,
u( ) =
−
=
.
After making the substitutions, the integral becomes ∫ ∫ √ u du = u / du, which evaluates to ( u
/
)
( =
) /
−
/
=
. .
6.5.2 Integration by Parts Another technique that can help us solve integrals is integration by parts. This technique is based on the following theorem: Theorem: If u and v are integrable functions in the domain [a,b] with derivatives du and dv, then b ∫ b ∫ b u dv = uv − v du. a
a
a
Proof : Integration by parts is essentially the product rule for derivatives in reverse. Recall from Section 4.8 that d [f(x)g(x)] = f ′ (x)g(x) + f(x)g′ (x). dx Taking the integral of both sides, we have ∫ ∫ ∫ ′ ′ ′ f(x)g(x) = f (x)g(x) + f(x)g (x) dx = f (x)g(x) dx + f(x)g′ (x) dx. ∫ Subtracting both sides by f ′ (x)g(x) dx, ∫ ∫ f(x)g′ (x) dx = f(x)g(x) − f ′ (x)g(x) dx. Finally, relabel the functions such that u = f(x) and v = g(x), and we have ∫ ∫ u dv = uv − v du.
CHAPTER 6. INTEGRATION
Integration by parts is useful when an integrand can be written as the product of two functions, when one of the functions can be simplified by taking the derivative, and when the other can be simplified (or at least not made more complicated) by integrating. To use integration by parts, follow these steps: 1. Identify two functions that multiply together to form the integrand. One of these functions must be differentiated, and the other must be integrated. 2. Choose the function that will be most effectively simplified by differentiation, and label it u. 3. Label the other function dv. Let the dx inside the integral be part of dv. 4. Find du by taking the derivative of u. Include dx on the right-hand side, as we did with u-substitution in Section 6.5.1. 5. Find v by taking the indefinite integral of dv. Don’t bother writing “+ c” at the end of the antiderivative. The constant will be accounted for later. 6. Plug u, v, and du into the formula: ∫ uv − v du. ∫ 7. If you chose u and dv well in Steps 1, 2, and 3, then v du should be easier to solve. If it is not obvious how to solve this new integral, then go back and try different choices for u and dv, or consider another integration technique.5 Figure 6.7
Integration by Parts (XKCD comics by Randall Munroe, #1201)
Source: xkcd.com/1201
5 Some integration by parts problems yield a new integral that needs to be integrated by parts again. Iterated integration by parts is cumbersome, but it can work in some situations.
179
MATHEMATICS FOR SOCIAL SCIENTISTS
180
Example 1. Solve
∫ x
x
dx.
The integrand is the product of two functions, x and x . x will simplify if we take its derivative, so we set u = x and dv = x dx. We can then find du and v: du = dx,
∫
x
x . v= dx = ln( ) ∫ Next we plug u, v, and du into uv − v du and solve
∫ x x − ln( ) ln( )
x
x x x − dx = ln( ) ln( ) = =
ln( ) .
−
ln( )
[
−
] ln( )
−
ln( )
.
Example 2. Find the antiderivative of ln(x) by solving
∫
ln(x)dx.
It may not look like it, but the integrand really is the product of two functions, ln(x) and . ln(x) will simplify if we take its derivative (see Section 4.8), and won’t be very complicated to integrate. So we set u = ln(x) and dv = dx. We can then find du and v: du = dx and v = x. x ∫ Now we plug u, v, and du into uv − v du and solve ∫ ∫ x ln(x) − x dx = x ln(x) − dx x = x ln(x) − x + c.
6.5.3 Improper Integrals Suppose you have to solve a definite integral that looks like ∫ ∞ f(x)dx. a
At this point, you can handle many definite integrals, but what do you do about a bound of ∞? Don’t panic!6 Integrals with ∞ or −∞ in the bounds are called improper integrals,
6 Douglas
Adams would tell you that the answer is 42, anyway.
CHAPTER 6. INTEGRATION
181
and solving them only requires one extra step. The best way to handle infinite quantities is by using limits. We can rewrite the improper integral with a limit: ∫ k lim f(x)dx. k→∞
a
Then, to solve, we proceed in the same way, evaluating the limit after the integral: k ( ) lim F(x) = lim F(k) − F(a). k→∞
k→∞
a
A solution to the improper integral exists when the limit = limk→∞ F(k) exists (see Section 4.1 for the discussion of limits). Example 1. Solve
∫
∞
e−x dx.
We can use u-substitution to solve this integral. Let u = −x; then du = −dx, so the (indefinite) integral solves to ∫ −
eu du = −eu .
In this case, it is easier to substitute back for u than to reconfigure the bounds for u (either method is fine). So the antiderivative is ∞ −x −e . Using a limit to deal with the infinite bound, we can solve the definite, improper integral: k ( ) lim −e−x = − lim e−k + e = + = . k→∞
k→∞
Recall that x must have a particular domain. Much of the time, x exists in the set of real numbers, so that x ∈ (−∞,∞). Sometimes x exists in a domain that is smaller than all real numbers. By construction, any integral of f(x) evaluates to 0 when x does not exist. Sometimes it is useful to refer to a general definite integral that looks like ∫ ∞ f(x)dx. −∞
In this case, whoever wrote this nasty-looking improper integral does not necessarily want you to use two limits. The person is just trying to generalize over any possible domain. Suppose that x ∈ [− , ]. Then for any x that is not in this range, the integral is 0; so ∫ ∞ ∫ f(x)dx = f(x)dx. −∞
−
In other words, when you see an integral with bounds at −∞ and ∞, you are being asked to plug in the domain of x into the integral. Such integrals are only improper when the domain of x involves some infinite bound (such as “all real numbers” or “all positive real numbers”).
MATHEMATICS FOR SOCIAL SCIENTISTS
182
6.6 Probability Density Functions One of the most important applications of integrals that you will encounter is the application of integration to probability functions in statistics. This section is not meant to replace a thorough discussion in a statistics text, but hopefully it will emphasize the important connections between statistics and the calculus on which it operates. A random variable takes on values that represent possible outcomes. For each outcome, there is an associated probability that the outcome will actually occur. There are two types of random variables: discrete and continuous. Discrete random variables have discrete outcomes, and continuous random variables have continuous outcomes; for example, a coin flip has the two discrete outcomes of heads or tails, but a runner’s time in a sprint is continuous. Basic statistics focuses on functions that match probabilities to outcomes. For discrete random variables, these functions are called probability mass functions, or just PMFs for short. PMFs can be described with a table in which one column lists the outcomes and the other column lists the probability associated with each outcome. For the simple examples of a coin flip and the roll of a die, the PMF is described in the left and center panels of Table 6.1. To take another example, suppose a voter is choosing between voting for Mitt Romney, Barack Obama, and Hillary Clinton. The voter has a very liberal political ideology, so one possible PMF for the random variable of this voter’s vote choice is given in the right panel of Table 6.1. Table 6.1
The PMF of a Coin Flip (Left), a Roll of a Die (Center), and an Individual’s Vote (Right) Outcome Probability
Outcome Probability
1
1/6
2
1/6
3
1/6
Outcome Probability
4
1/6
Romney
.1
Heads
.5
5
1/6
Obama
.6
Tails
.5
6
1/6
Clinton
.3
For a continuous random variable, these functions are called probability density functions, or just PDFs. It is no longer possible to describe the PDFs with a table because there are infinitely many outcomes. Instead, PDFs are best described with a graph, such as the one in Figure 6.8. For a PDF, probability is represented by the area under the curve. The probability of an outcome being between a and b (e.g., the finishing time for a sprinter being between 30 and 40 seconds) is the integral of the PDF from a to b. PDFs can operate in this way because they are designed to have a total area under the curve of exactly 1.
CHAPTER 6. INTEGRATION
The Probability Density Function of the Standard Normal Distribution
0.3 0.2
Total Area = 1
0.0
0.1
Standard Normal PDF
0.4
Figure 6.8
183
−3
−2
−1
0
1
2
3
xx
Definition: Probability density function y A function is a PDF if and only if the function is never less than 0 and the total area under the curve of the function over the domain of the random variable is equal to 1. In other words, if x ∈ [m,n], then f(x) is a PDF if and only if f(x) ≥ for all x ∈ [m,n] and ∫ ∞ ∫f n(x) f(x)dx = −∞
f(x)dx = . m
f (y)
This means that the probability of x falling somewhere within the range of possible outcomes is 1. More interesting questions use more restrictive bounds. One thing to notice about PDFs is that height of the curve doesn’t matter; only area matters. It is possible for the height of a PDF to be greater than one, or for the sum of two heights on the curve to add to a value that is greater than one. But any subset of the area underneath a PDF will always be bounded between 0 and 1. Increased height goes along with increased area, but the area of a region also depends on the width of the distribution in that region. So it’s area, not height, that has a meaning for the likelihood of outcomes. Also, the probability of any one particular outcome is 0. This fact may seem a bit counterintuitive, but consider 1 the area under the curve of one point. There is height but no width. The area of a rectangle with no width is 0, so the probability of one outcome is 0. For example, in a marathon, there is a nonzero probability that a runner finishes the race in between 4 and 5 hours. But there is zero probability that a runner finishes in precisely 4 hours and 30 minutes, and not one nanosecond before or after. It is more useful to think about the probability of a small range of possible outcomes for a continuous random variable than any one particular outcome.
MATHEMATICS FOR SOCIAL SCIENTISTS
184
Example 1. The logistic PDF is a very important distribution in the social sciences and is one of the most commonly used distributions for modeling the probability of outcomes that can take on one of two values. The logistic PDF is ex f(x) = . ( + ex ) (a) Prove that this function is indeed a PDF. (b) Suppose that this function represents the distribution of attitudes on same sex marriage within the U.S. adult population, where x = is the most moderate position, ideal points greater than 0 are positions that express opposion to same sex marriage, and ideal points less than 0 are positions in favor of same sex marriage. What is the probability that a randomly chosen individual from this population has an ideal point between x = and x = ? First, we consider (a). To show that a function is a PDF, we have to demonstrate that the function is greater than or equal to 0 for all points in its domain, and that the definite integral of the function over its domain is 1. The domain of the logistic distribution is all real numbers. Note that the function ex can never be negative, even when the exponent is negative. Adding one to ex and squaring it also produces strictly positive numbers. Since the both the numerator and the denominator are always positive, the whole function must be greater than or equal to zero for all real numbers. Next, we have to demonstrate that ∫ ∞ ex dx = . x −∞ ( + e ) We can solve this integral through u-substitution. Let u = + ex . Then du = ex dx, and when we substitute u and du into the integral, the expression becomes ∫ ∫ du = u− = −u− . u Substituting back for u, the antiderivative is − ( + ex )− =
− . + ex
Now we have to consider the bounds. This is an improper integral, so we let the bounds be A and B and use limits to consider A → −∞ and B → ∞: ∞ − − − = lim − lim . + ex −∞ B→∞ + eB A→−∞ + eA Consider the first limit. As B → ∞, the denominator + eB also approaches infinity. Since the fraction has a finite numerator and an infinite denominator, the entire limit is 0. Now consider the second limit. As A → −∞, eA → since any positive real number to an infinitely negative exponent approaches 0. Therefore, the denominator approaches 1, and the second limit is − . The expression thus becomes − (− ) = , so the function is indeed a PDF. Most of the work for solving (b) is already done since we found the antiderivative while solving
CHAPTER 6. INTEGRATION
185
(a). This question in (b) is asking for the area under the curve between x = and x = . So we have to solve the integral ∫ ex − dx = + ex + ex − − − +e +e
=
−.
=
− (−.
)=.
.
So there is a .15 probability that a randomly chosen person has an ideological ideal point between 1 and 2.
The antiderivative of a PDF is called a cumulative distribution function, or CDF. Definition: Cumulative distribution function Let f(x) be a PDF. Then the CDF is F(x), where ∫ x F(x) = f(x)dx. −∞
In other words, the height of the CDF equals the area under the PDF from the beginning of the domain of x to the specified point.
The Probability Density Function (Left) and Cumulative Distribution Function (Right) of the Standard Normal Distribution
0.6
F(0) F (0) = =.5.5
0.4
Standard Normal CDF
0.2
●
y=1 0.2
0.1
Standard Normal PDF
0.3
0.8
1.0
0.4
Figure 6.9
Area = .5
0.0
0.0
y = ex
−3
−2
−1
0
xx
1
2
3
−3
−2
−1
0
x = 21
2
3
xx y
y y expresses the probaUnlike the height of a PDF, the height of a CDF matters. The height bility of all outcomes less than or equal to a specified value of x. In Figuref 0(y)6.9, the area under the PDF from the beginning of the domain to the point x = is .5. The value of the CDF at f (x) f (x) the point x = is also .5, so the probability that x is less than 0 is .5. Since PDFs must have 1
f (y)
f (y)
MATHEMATICS FOR SOCIAL SCIENTISTS
186
an area of 1, CDFs must always start at 0 and end at 1. CDFs can also be used to calculate the probability of regions of x that do not stretch all the way back to the beginning of the domain. Suppose we wanted to find the probability that x is between two points a and b in the domain of x. First, we can use the CDF to find F(b) = P(x ≤ b). Then we can use the CDF to find F(a) = P(x ≤ a). Then the probability that x is between these points is just P(a ≤ x ≤ b) = F(b) − F(a). In other words, we start by shading in all of the area under the PDF from the left endpoint of the domain to b, then we remove the area from the left endpoint to a. What’s left is the region between a and b. Example 2. Derive the logistic CDF, and use this function to solve (b) of Example 1. The logistic CDF is the integral from −∞ to x of the logistic PDF. We have already found the antiderivative in Example 1. We use a limit to consider the bound of −∞: x ∫ x ex − − − dx = lim = − lim . x) x x A→−∞ A→−∞ ( + e + e + e + eA −∞ A Recall from Example 1 that the limit evaluates to − . Therefore, the CDF F(x) is − F(x) = + , + ex which we can simplify by replacing 1 with a fraction with +ex in the numerator and denominator, and adding the two fractions together: =
− + ex − + + ex ex + = = . + ex + ex + ex + ex
Considering again (b) in Example 1, we calculate F( ) − F( ): F( ) − F( ) =
e e − =. +e +e
−.
=.
,
which is the same answer we derived in Example 1.
Example 3. In statistics, one kind of question that often comes up involves confidence intervals on the standard normal curve,7 f(x) = √
−x
e , π which is the curve graphed in Figures 6.8 and 6.9. In general, no analytic solution exists for integrals of the standard normal distribution. We look up values of this integral for certain bounds in z-score tables. Confidence intervals are bounds (symmetric around 0) for which the area under the curve is a particular value. The most immediate application of confidence intervals is in statistical sampling. For large enough samples, the central limit theorem says that the true population mean will be distributed randomly around the mean of the sample. 0 on the x-axis refers to the sample mean, and each unit on the x-axis refers to 1 standard deviation in the sample. For the standard normal PDF described
7 The
standard normal curve is a normal curve with mean 0 and standard deviation 1.
CHAPTER 6. INTEGRATION
187
in the figure below, we know that ∫
.
f(x) dx = . . −.
That is, we are 95% certain that the population mean falls within a range from 1.96 standard deviations below the sample mean to 1.96 standard deviations above the sample mean.
This interval is called a 95 percent confidence interval and is used as a baseline for statistical significance in most social science research. A researcher proposes a null hypothesis that the population mean is equal to a certain value. The researcher then checks to see if this value appears in the 95% confidence interval for the population mean, and if it does not, the researcher rejects that null hypothesis.8
PDFs and CDFs are central topics for any introductory statistics course. Understanding integrals is crucial for understanding how to use PDFs and CDFs. But integration often stands in the way of students who want to learn how to use them. Hopefully at this point, you have the knowledge of integration necessary to use PDFs and CDFs to address questions in your own field of research.
8 More commonly, these kinds of statistical significance tests are performed on parameters that express the effect of one variable on the other. The null hypothesis is that the effect is 0; that is, there is no effect. If this null hypothesis is rejected, then the effect of the variable on the other variable is “statistically significant at the 95% level.”
MATHEMATICS FOR SOCIAL SCIENTISTS
188
6.7 Moments Recall from Section 3.1 that there are two ways to think about probability: (1) as a measure of frequency or (2) as a measure of belief. Methods that use a frequentist point of view may be highly geared toward estimating individual probabilities, without considering the entire PDF that these probabilities come from. Bayesianists—statisticians who prefer to think about probability as a measure of belief—argue that point estimation of probability is at best a limited description of the situation and that probability distributions provide a much more complete picture. Specifically, the shape of a probability distribution can tell us a lot about a situation. A tall and skinny PDF tells us that we are much more certain about a small number of outcomes. The peak of the graph indicates the most likely outcomes. There is useful information on where the middle of the distribution falls and whether there are more outcomes to the right or to the left of the mean of the distribution. Moments are quantities that describe the shape of a probability distribution. There is a statistical technique called the method of moments, which involves estimating these moments for a sample and using the information about shape to build an estimate for the distribution of the population. The first moment of a probability function is its mean, or expected value. A statistical theorem called the law of large numbers tells us that repeated, independent draws from a probability distribution will have an average that converges to the expected value of the distribution, given enough draws. For a PMF f(x), the expected value E(x) is given by ∑ E(x) = x f(x). x
PMFs have discrete outcomes. The expected value of a PMF is the sum over possible outcomes of the value of the outcome times the probability of that outcome. Example 1. What is the expected value of one roll of a die? The sample space of a roll of a die is { , , , , , }. Each of these outcomes has an equal probability, and the PMF, f(x), is equal to / for all possible outcomes. The expected value of a roll of a die is E(x) =
∑
xi f(xi )
i=
= =
(
) ( ) ( ) ( ) ( ) ( ) × f( ) + × f( ) + × f( ) + × f( ) + × f( ) + × f( ) ( ) + + + + + = = . .
By the law of large numbers, if we roll a die again and again, the average of the rolls should converge to a value of 3.5.
For a continuous random variable, the expected value of the probability density function is ∫ ∞ E(x) = xf(x)dx. −∞
CHAPTER 6. INTEGRATION
189
This formula for expected value is analogous to the one for PMFs, but an integral is used instead of a sum because the random variable is continuous. Example 2. What is the expected value of a randomly drawn number from a uniform distribution between two points a and b? The uniform distribution places equal probability on all possible continuous outcomes between a and b. The PDF is for all x ∈ [a,b]. f(x) = b−a If we keep the bounds a and b general for now, the expected value is ∫ b ∫ b E(x) = x dx = x dx b−a b−a a a ( =
=
b−a
x
) b ( ) a b = − b−a a
(b + a)(b − a) b −a b+a = = . (b − a) (b − a)
So the expected value of a uniform distribution is just the average of the bounds. For example, by the law of large numbers, if we take many independent draws from the uniform distribution between 0 and 10, the average of these draws will converge to 5.
Often we are interested in the expected value not only of a single random variable but of a function of many random variables as well. Expected values obey many of the same rules that summations and derivatives follow. In the following presentation, let X and Y be random variables, and let c be a nonrandom constant: 1. In probability, we consider any quantity that is not a random variable to be a constant. The expected value of a constant is equal to the constant itself: E(c) = c. 2. A constant factor can be brought outside of an expected value: E(cX) = cE(X). 3. Expected values can be broken up over addition or subtraction: E(X + Y) = E(X) + E(Y), E(X − Y) = E(X) − E(Y). The same principle can be extended to summations: (∑ ) ∑ N N E Xi = E(Xi ). i=
i=
MATHEMATICS FOR SOCIAL SCIENTISTS
190
4. The expected value of a product is not in general equal to the product of the expected value of each factor, unless these factors are independent:9 E(XY) = E(X)E(Y)
if X and Y are independent.
The same principle can be extended to long products: (∏ ) ∏ N N E(Xi ) if X , X , . . . , XN are all independent. E Xi = i=
i=
Example 3. Let X, Y, and Z be random variables. Suppose that E(X) = , E(Y) = − , and E(Z) = . Also, suppose that Y and Z are independent. Find the following expected values: (a) E( X + Y − Z + ) First, we can break up the expected value over addition and subtraction: E( X) + E( Y) − E(Z) + E( ). The expected value of the constant 4 is itself equal to 4: E( X) + E( Y) − E(Z) + . Next, the constant factors can be brought outside of the expected values: E(X) + E(Y) − E(Z) + . Finally, we plug the individual expected values into the expression and solve: ( ) + (− ) − ( ) +
=
− − +
=− .
(b) E(− + X + YZ) First, we break the expected value up over addition and subtraction: E(− ) + E(X) + E( YZ). The expected value of − is − , and we can bring the constant factor of 2 outside of the expected value: − + E(X) + E(YZ). Since Y and Z are independent, in this instance we can break the expected value up over multiplication: − + E(X) + E(Y)E(Z). Finally, we plug in the individual expected values and solve the expression: − + + (− )( ) = − .
9 Technically, all that is required for this property to be true is that the random variable factors X and Y are uncorrelated (see Section 7.4.4 for a formal discussion of correlation). It is possible for two variables to be nonindependent but still uncorrelated. However, since independence implies no correlation, this property holds for independent random variables.
CHAPTER 6. INTEGRATION
191
(c) E( + XY − Z) We can start by breaking the expected value up over addition and subtraction: E( ) + E(XY) − E( Z). But here we have to stop. We do not know whether X and Y are independent, so we cannot assume that the expected value can be extended to each factor. To calculate this expected value, we would need to know the joint PDF for X and Y, a topic covered in Section 7.4.4. Therefore, we do not have enough information to solve this problem.
Example 4. In Section 2.5.3, we were introduced to the multiple regression model. Please refer to that section if you are unfamiliar with the following notation. A multiple regression of a dependent variable yi on five independent variables xi , xi , xi , xi , xi is given by yi = α + β xi + β xi + β xi + β xi + β xi + ϵi . In a regression model, the error term ϵi is considered to be a normally distributed random variable with an expected value of 0. All other terms in the regression other than yi are considered to be constant. What is the expected value of yi in the multiple regression model listed above? We begin by taking the expected value of both sides: E(yi ) = E(α + β xi + β xi + β xi + β xi + β xi + ϵi ). The expected value on the right-hand side of the equation breaks up over addition: E(α) + E(β xi ) + E(β xi ) + E(β xi ) + E(β xi ) + E(β xi ) + E(ϵi ). Notice that every expected value except for the last one is the expectation of a constant. All of these expected values are equal to the terms inside these expected values: α + β xi + β xi + β xi + β xi + β xi + E(ϵi ). Finally, we know that E(ϵi ) = , so the whole expected value is equal to E(yi ) = α + β xi + β xi + β xi + β xi + β xi . In other words, yi is expected to be on the best-fit line implied by the regression.10
If x is a random variable, then the quantity x − E(x) is the variable centered on a mean of 0. This transformation shifts a distribution to the left or to the right so that its mean occurs precisely at 0. Once a random variable has been centered, we can calculate higher moments of its PDF. There are infinitely many moments for an integrable PDF, and there is a general formula for the nth moment:
10 Technically, for a multiple regression, we aren’t drawing a best-fit line. We are drawing a best-fit plane in multiple
dimensions. But the idea holds that we expect values of yi to be on this plane on average.
MATHEMATICS FOR SOCIAL SCIENTISTS
192
Definition: nth Moment The nth moment of a PDF of a variable is given by the quantity ∫ ∞ (x − c)n f(x) dx, −∞
where c = E(x) is the mean of the variable’s distribution. For distributions that are transformed to have a mean of 0, the nth moment is ∫ ∞ xn f(x) dx. −∞
The first moment, the expected value, is a special case of this moment definition and occurs when n = and c = (otherwise we’d be subtracting the expected value away from itself). The second moment for a distribution centered on 0 is called the variance V(x) of the distribution. Using the definition of the second moment, we have ∫ ∞ V(x) = x f(x) dx. −∞
Plugging x in for x in the definition of expected value above, we have a theorem for the variance of a mean-centered distribution: Theorem: If E(x) = , then V(x) = E(x ).
If the distribution is not centered on 0 but instead has an expected value equal to c, the variance is ∫ ∞ V(x) = (x − c) f(x) dx. −∞
This definition implies that variances cannot be negative, V(x) ≥
∀x,
since variances are calculated from squared deviations from the expected value. There is a theorem that describes the relationship between the mean and variance of any distribution, regardless of whether or not the distribution has been centered on its mean: Theorem: V(x) = E(x ) − E(x) . Proof : The variance of any distribution is ∫ V(x) =
∞ −∞
(x − c) f(x) dx,
where c is the expected value of the distribution. Multiplying out the squared term gives us ∫ ∞ V(x) = (x − cx + c )f(x) dx. −∞
CHAPTER 6. INTEGRATION
193
We can break up the integral across these terms and bring the constants outside the integral: ∫ ∞ ∫ ∞ ∫ ∞ V(x) = x f(x) dx − c xf(x) dx + c f(x) dx. −∞
−∞
−∞
Note that the second integral is the expected value of the distribution, which we are denoting here as c, and that the third integral evaluates to 1 since it is a PDF integrated over its domain. Therefore, the variance can be rewritten as ( ) ( ) ∫∞ ∫∞ V(x) = x f(x) dx − c(c) + c = x f(x) dx −c . −∞ −∞ Finally, note that
∫∞ −∞
x f(x)dx = E(x ). We also know that c = E(x); therefore, V(x) = E(x ) − E(x) .
Example 5. Derive the variance of the uniform distribution between a and b. We now know the expected value of the uniform distribution from Example 2, so using the formula for variance, the calculation breaks down to V(x) = E(x ) − E(x) ( ) ∫ b b+a x = dx − b−a a ( ) ∫ b b+a x dx − = b−a a b ) ( ( ) b+a = x − b−a a ( ) b −a b+a = − (b − a) ( ) (b − a)(b + ab + a ) b+a = − (b − a) = = =
b + ab + a
−
b + ab + a b − ab + a
b + ab + a −
b + ab + a
.
So the variance of the uniform distribution between a and b is V(x) =
(b − a)
.
MATHEMATICS FOR SOCIAL SCIENTISTS
194
For the uniform distribution between 0 and 10, the expected value is 5, but the variance is ( and the standard deviation is
√
.
− )
= .
=
= . ,
.
Like expected values, there are rules for deriving the variances of functions of random variables. In the following presentation, let X and Y be random variables, and let a, b, and c be nonrandom constants: 1. The variance of a constant is 0 because, by definition, constants do not vary: V(c) = . By the same logic, adding a constant to a random variable does not change the variance of that random variable: V(X + c) = V(X). 2. Constant factors cannot simply be brought outside of a variance. They have to be squared first: V(cX) = c V(X). 3. Section 7.4.4 defines a quantity called a covariance. A covariance (Cov) measures how much two random variables tend to vary with each another. For a detailed discussion of covariance, refer to Section 7.4.4. The variance of the mean and difference of two random variables depends on the variance of each variable and the covariance between the two variables. The variance of a sum is V(X + Y) = V(X) + V(Y) + Cov(X,Y), and the variance of a difference is V(X − Y) = V(X) + V(Y) − Cov(X,Y). 4. If two random variables are independent, then their covariance is equal to 0. In this case, and only in this case, does a variance break up over addition: V(X + Y) = V(X) + V(Y)
if X and Y are independent.
However, even if the two variables are independent, the variance does not break up over subtraction. Instead, V(X − Y) = V(X) + V(Y)
if X and Y are independent.
5. The variance of a weighted sum of two random variables also depends on the variance of each variable and the covariance between them: V(aX + bY) = a V(X) + b V(Y) + abCov(X,Y). The variance of a weighted difference is given by the same formula in the special case where b is negative.
CHAPTER 6. INTEGRATION
195
In many cases, statisticians use the shorthand of μ for the mean of a distribution and σ for the variance of the distribution. σ, the square root of the variance, is the standard deviation of the distribution. Example 6. Consider two random variables X and Y where V(X) = , V(Y) = , and the covariance between the two variables is Cov(X,Y) = − . Find the variance of the following expressions: a. X − By Rule 1, adding a constant to a random variable does not change the variance of the random variable. So the variance of this expression is V(X − ) = V(X) = . b. X − Y Since the covariance is not equal to 0, these variables are not independent, so we use the formula from Rule 3: V(X − Y) = V(X) + V(Y) − Cov(X,Y) = + − (− ) = . c. − X − Y + First, we apply Rule 1 to remove the constant addend of 10: V( X − Y +
) = V( X − Y).
The expression is now a weighted sum, so we use the formula in Rule 5: V( X − Y) =
V(X) + (− ) V(Y) + ( )(− )Cov(X,Y)
= V(X) +
V(Y) − ( )−
= ( )+
Cov(X,Y)
(− ) =
.
Example 7. Consider N random variables, x , x , . . . , xN . Suppose that each variable has the same variance, σ , and that the variables are all independent. Find a formula for the variance and standard deviation of the mean of all of these variables. The variance of the mean of all N variables can be written as ( ∑N ) i= xi V . N First, we can square the constant factor
and bring it outside the variance: N ( ) ∑N V x i i= . N
Since the variables are independent, we can break the variance up over addition: ∑N i= V(xi ) . N
MATHEMATICS FOR SOCIAL SCIENTISTS
196
Each variable has the same variance. Plugging in these variances, we get ∑N i= σ . N Note now that the term inside the summation does not depend on the summation index i. Recall from Section 1.6 that the summation from 1 to N of a term that does not depend on the index is the same as multiplying the term by N. So the variance of the expression becomes ( ∑N ) Nσ σ i= xi V = = . N N N The standard deviation (SD) of the mean is the square root of the variance: ( ∑N ) √ σ σ i= xi SD = = √ . N N N This problem is one of the most common problems in the social sciences. We can think of x , x , . . . , xN as a random sample from a population. We often assume that the observations in our random sample are both independent and identically distributed (IID) with the same variance. In ∑ that case, the sample mean, Ni= xi /N, is an estimate of the mean of the population. The standard deviation of the sample mean is called the standard error. As will be discussed in Section 9.3, the standard error is used to calculate confidence intervals and to test hypotheses.
For normally distributed random variables with expected value μ and variance σ , the variable can be normalized by subtracting the mean and dividing by the standard deviation. This new variable is called a z-score. In other words, if x ∼ N(μ,σ ), then z ∼ N( , ), where z=
x−μ . σ
The third moment of a normalized random variable, ∫ ∞ z f(z)dz, −∞
is called the skewness of a PDF. The skewness represents how lopsided a probability function is. If a PDF is perfectly symmetric around its mean, like a normal distribution, then the PDF has a skewness of 0. If there is more probability mass to the right, then the skewness will be positive. If there is more probability mass to the left, then the skewness will be negative. The fourth moment of a random variable, centered on its mean, is a measure of how fat or skinny (or how sharply “peaked”) the PDF is around its mean. A linear transformation of the fourth moment, which compares the width of a distribution with the width of a normal distribution, is called the kurtosis of the PDF.
CHAPTER 6. INTEGRATION
197
Exercises
1. Consider the standard normal PDF (the normal distribution with a mean of 0 and a variance of 1): f(x) = √
π
e−
. x
,
which has the graph displayed in Figure 6.10.
f (x)
0.3 0.0
f (y)
0.2
f(x)
y
0.1
x
0.4
Figure 6.10 The Standard Normal Distribution
−1.96
1.96 xx
1
y
The normal distribution, like other probability distributions, is useful because the area under the curve equates to the probability that the random variable x could be in that re(x) gion. For the standard normal distribution,fwe know that 95% of the area falls between the values of x = − . and x = . . One strange fact is that although we use the normal distribution and calculate the area under f (y) this curve all the time, this function has no analytic integral. The word analytic means “able to express as a formula,” so there is no way to express the integral of the standard normal distribution with a formula. We have to resort to approximating values of that integral or to looking up values of this distribution in the back of old stats textbooks (z-score tables). In this problem, you will calculate the approximate area that exists between x = − . and x = . . You can approximate the integral ∫ . √ e− . x dx π −. 1
by using a Riemann sum. Copy down the Figure 6.10 four times. (a) In the four figures you just copied, draw the rectangles for the left, right, midpoint, and trapezoidal Riemann sums from x = − . to x = . with 10 partitions.
MATHEMATICS FOR SOCIAL SCIENTISTS
198
(b) Write down the formulas for the left, right, midpoint, and trapezoidal Riemann sums you drew in (a). (c) Evaluate the value of the Riemann sums you specified in (b). Which of the four Riemann sums provides the best approximation for the 95% confidence interval of the standard normal distribution? 2. Solve the following integrals: ∫ (a) x + ex − ( x ) dx ∫ √ (b) x + dx x ∫ ∞ (c) dx x ∫ y √ d (d) x dx dy − ∫ d (e) ex dx dz √z+ln(z) 3. Evaluate the following integrals using u-substitution: ∫ (a) ( x − x + )( x − x + x) dx ∫ (b) xex − dx ∫ x+ (c) dx x + x− ∫ √ (d) x x + dx 4. Use integration by parts to solve the following integrals: ∫ √ (a) x x + dx ∫ (b) xex dx ∫ e (c) x ln(x) dx ∫ (d) x ex dx (Hint: You’ll have to do integration by parts twice to evaluate this integral.) 5. Evaluate the following integrals using whatever method seems to work best: ∫ √ x + ex dx (a) ln( )
CHAPTER 6. INTEGRATION
199
∫
e ln(x) (b) dx x e ∫ ln(x) dx (c) x ∫ √ (d) ex + x dx
∫ y −
(e) ∫
x
(f)
√
y+
dy
x + dx
∫ x ln(x) dx
(g) ∫ (h)
∞
x + x+ dx (x + x + x + )
6. Consider the example of your students’ scores on a test. Suppose these scores are coded as percentages, where 0.5 means 50% and 1 means 100%. If I ask you to predict the score of a student before I pick that student at random, you cannot and should not be able to give me a prediction with certainty. You can, however, give me an educated guess about that student’s likely score. Suppose that the test scores are distributed according to the following function: √ x f(x) = . This function is graphed in Figure 6.11. The x-axis of this graph contains the possible test scores, from 0% to 100%. (a) Demonstrate that this function is a PDF over the domain x ∈ [ , ]. (b) Suppose that you are using the following grading scale:
Score (%) Letter Grade >90
A
80–90
B
70–80
C
60–70
D
, there exists a δ > if
such that
< |x − a| < δ, then |f(x) − L| < ϵ.
The definition means that we can make the distance between L and the value of the function at x be as close as we want by allowing x to be sufficiently close to a. This definition implies that a limit only exists when the left and right limits are equal to each other: lim f(x) = lim f(x).
x→a+
x→a−
For the three-dimensional case, the definition is as follows: Definition: Limit in three dimensions lim
(x,y)→(a,b)
f(x,y) = L
if and only if for all ϵ > , there exists a δ > such that √ if < (x − a) + (y − b) < δ, then |f(x,y) − L| < ϵ.
The distance between any two points, say (x,y) and (a,b), is given by the following formula: √ D = (x − a) + (y − b) . The definition of a three-dimensional limit says that the value of the function can be made as close to a value L as we want by choosing a point (x,y) that is sufficiently close to (a,b). Like the two-dimensional case, a three-dimensional limit only exists when it approaches the same value from different directions. Unlike the two-dimensional case, however, there are many more directions to consider. The limit lim
(x,y)→(a,b)
f(x,y)
exists only when all the left and right limits are equal to each other on every line that goes through the point (a,b) and when these limits are equal to the limits on every other line through that point. Proving that a limit exists is difficult because there are infinitely many lines that go through one point. In addition, the limits must also exist and be equal over all parabolas, cubic functions, and infinitely many functions of other forms. To prove that a limit exists, the brute-force approach employed in the two-dimensional case will not work. Mathematicians use certain tricks for these proofs, but these proofs are beyond the scope of the material presented here. On the other hand, it is easy to demonstrate that a limit doesn’t exist by showing that two linear (or quadratic or cubic, etc.) functions have different limits at a point or that the left and right limits are unequal for a line through the point. Imagine a landscape with a sharp cliff running through the middle. If a three-dimensional graph has such a cliff, then the limit will not exist for points on that cliff.
MATHEMATICS FOR SOCIAL SCIENTISTS
210
One way to show that a limit doesn’t exist at a point is to consider a general line through the point. If we plug the point (a,b) into the formula for the slope of a line, then we can derive the general linear equation: y−b m= , x−a y − b = m(x − a), y = mx − (ma − b). The slope of this general linear equation is m, and the y-intercept is −(ma − b). For a point (a,b), simply plug in a and b; then any line that goes through the point (a,b) can be expressed by choosing the appropriate slope m. For example, to evaluate the existence of a limit where (x,y) → ( , ), we have to consider all lines of the form y− , x− y − = m(x − ), m=
y = mx + (− m + ). This group of linear functions includes y = x − when m = and y = x − when m = . For any of these lines, plugging the linear function in for y, the left and right limits must exist for the limit to exist. x = 0 Figure 7.2 shows several lines that go through the point ( , ). If the limit of a function does not exist at ( , ), we can prove it by plugging the general formula for the linear function in for y, evaluating the two-dimensional limit, and seeing if the answer depends on the slope m. If so, then the limit takes on different values for different lines around y limit = 1 does not exist. the point, and the Closing In on the Point ( , ) Along Various Lines Through ( , ) in the xy-Plane 3.0
y = ex
1
1.5
f 0(y)
1.0
y
2.0
2.5
x=2
y
Figure 7.2
2.0
2.5
3.0
xx
y
f (x)
3.5
4.0
CHAPTER 7. MULTIVARIATE CALCULUS
211
Example 1. Prove that the limit lim
(x,y)→( , )
x +y xy
does not exist. To prove that the three-dimensional limit does not exist, we have to show that the limit takes on different values for different lines through the point ( , ). Any line through ( , ) is of the form y = mx, for some value of m. We can substitute for y, leaving m general: lim
x→
x + (mx) . x(mx)
The limit simplifies to lim
x→
(m + )x m + m + = . = lim x→ mx m m
The values of the left and right limits along any line are equal, but this value is different for different lines. Along the line y = x, the limit is 2, but along the line y = x, the limit is 5/2. Therefore, the limit does not exist. In fact, for slopes greater than 1, the height of the function increases with the slope, like a spiral staircase.
The strategy for solving finite multivariate limits that do exist is nearly the same as the strategy to solve two-dimensional limits: 1. Try plugging in the values that x and y are approaching. 2. If the function is discontinuous at that point, plugging it in won’t work, so try to algebraically cancel out a problematic factor. 3. If all else fails, you can try plugging in numbers that are closer and closer to the limit that (x,y) approaches. Be careful, though: If the value of the limit depends on the rate at which the x values approach their limit relative to the rate at which the y values approach theirs, then the limit probably does not exist.
Example 2. Solve the limit lim
(x,y)→( , )
x −y . x+y
Plugging in the point ( , ) won’t work because that leaves 0 in the denominator. But there is an algebraic cancelation that we can use. The numerator factors to (x − y)(x + y), so the factor (x + y) cancels from the top and bottom, leaving us with lim
(x,y)→( , )
(x − y),
which after plugging in the point ( , ) evaluates to 0.
Continuity works the same way for multivariate functions as it does for two-dimensional functions.
MATHEMATICS FOR SOCIAL SCIENTISTS
212
Definition: Continuity of a three-dimensional function at a point A function is continuous at point (a,b) if and only if lim
(x,y)→(a,b)
f(x,y) = f(a,b).
A three-dimensional function is continuous if it is continuous for all ordered pairs (a,b) in its domain. The definition of continuity is completely analogous for higher-dimensional functions: Simply throw more arguments into the above definition.
7.3 Partial Derivatives Derivatives tell us the slope of a function. But multidimensional functions have several slopes at any one point. Imagine that you are standing on a two-peaked mountain, halfway between the two peaks. In front of you, the elevation rises toward one peak, and behind you the elevation rises to the other peak. To your left and to your right, the elevation drops toward valleys at the foot of the mountain. Now also imagine that you are standing at the intersection point of a pair of axes: one line going in front of you and behind you, toward the peaks, and a perpendicular line going to your left and right, toward the valleys. There is a derivative along the first axis that describes the inclining slopes in front of you and behind you, and there is a derivative along the second axis that describes the declining slopes to your left and right. Each of these derivatives is called a partial derivative. A multidimensional function will have a different partial derivative for each of its independent variables. A function f(x,y), for example, will have a partial derivative with respect to x and a partial derivative with respect to y. The partial derivative with respect to x is the change in the value of the function for a one-unit increase in x, and the partial derivative with respect to y is the change in the value of the function for a one-unit increase in y. It’s possible that the partial derivatives depend on each other: That is, the partial derivative with respect to x depends on the value of y, and vice versa. In other words, the change in the elevation of the landscape from east to west should depend on how far north or south you are. In three dimensions, two partial derivatives can describe a complex landscape, with rolling hills; sharp peaks, cliffs, dunes, and plateaus; or any other feature you can imagine.
7.3.1 Definition and Notation Recall from Section 4.6 that there are two versions of the definition of a derivative in two dimensions. Each version conceptualizes a derivative as the average rate of change between two points, where the two points are being pushed closer and closer together. The first version of the definition is f(x) − f(a) , f ′ (x) = lim x→a x−a
CHAPTER 7. MULTIVARIATE CALCULUS
213
and the second version is f(x + h) − f(x) . h The definition of a partial derivative is conceptually almost the same. A partial derivative is the average rate of change between two points, where these two points are pushed closer and closer together on one particular axis. We hold one of the variables constant, while we consider changes in the other variable. The partial derivative of a function f(x,y) with respect to x considers changes in x while leaving y alone. f ′ (x) = lim h→
Definition: Partial derivative with respect to x The partial derivative of a functon f(x,y) with respect to x is lim
x→a
which can also be written as lim
h→
f(x,y) − f(a,y) , x−a
f(x + h,y) − f(x,y) . h
Likewise, the partial derivative of a function f(x,y) with respect to y considers changes in y while leaving x alone. Definition: Partial derivative with respect to y The partial derivative of a functon f(x,y) with respect to y is lim
y→a
which can also be written as lim
h→
f(x,y) − f(x,a) , y−a
f(x,y + h) − f(x,y) . h
After deriving a formula for the partial derivative with respect to x, there may still be terms in the formula that contain y. If we plug in a point (x,y), then the formula will give us the slope of the function along the x-axis at the point (x,y). Likewise, we can use the formula for a partial derivative with respect to y to find the slope of the function along the y-axis at the point (x,y). Example 1. Use the definition of a partial derivative to find the partial derivative with respect to x of g(x,y) = x + y + x y − xy + x − y − . We will use the formula lim
x→a
g(x,y) − g(a,y) . x−a
MATHEMATICS FOR SOCIAL SCIENTISTS
214
Plugging in the function g, we get lim
x y − xy + x − y − ) − ( a + y + a y − ay + a − y − ) x−a ( x − a )+( y − y )+( x y− a y)−( xy− ay)+( x− a)−( y− y)− + x−a (x − a ) + y(x − a ) − y(x − a) + (x − a) x−a (x − a)(x + ax + a ) + y(x − a)(x + a) − y(x − a) + (x − a) x−a (x + ax + a ) + y(x + a) − y +
( x + y +
x→a
= lim
x→a
= lim
x→a
= lim
x→a
= lim
x→a
= (a + a + a ) + = a +
y(a + a) − y +
ay − y + .
Therefore, for any general point x, the derivative of g(x,y) with respect to x is xy − y + .
x +
As in Section 4.7, partial derivatives can also be denoted using Newton’s notation or Leibniz’s notation. Once again, it is important to be immediately familiar with both notations since they are both used interchangeably. First, partial derivatives are denoted in Newton’s notation by subscripting the function once with the variable that the function is being differentiated with respect to. For example, the first partial derivative of f(x,y) with respect to x is denoted fx (x,y) and with respect to y is denoted fy (x,y). Leibniz’s notation involves a fraction, where the independent variable is placed in the denominator. For first partial derivatives, the notation is very nearly the same. The denominator contains the variable that the function is being differentiated with respect to. Here, however, the symbol ∂ is used instead of d, to denote that the derivative is a partial derivative. The first partial derivative of f(x,y) with respect to x, in Leibniz notation, is ∂ f(x,y) ∂x
or
∂f ∂x
∂ f(x,y) ∂y
or
∂f . ∂y
and with respect to y is
Second partial derivatives are a bit more difficult to conceptualize than regular second derivatives. Just as a first partial derivative can be taken with respect to any of the independent variables, second partial derivatives can be taken with respect to any of the independent variables, regardless of which variable was used for the first partial derivative. A threedimensional function f(x,y) has two partial derivatives. Each of these derivatives has two
CHAPTER 7. MULTIVARIATE CALCULUS
215
partial derivatives. Therefore, the function has four second partial derivatives. Newton notation adds another subscript to denote a second partial derivative. Leibniz notation multiplies derivative fractions together. The notation for the second partial derivative of a function f(x,y) with respect to x and then with respect to x again is ∂ f fxx (x,y) or . ∂x The notation for the second partial with respect to x and then y is fxy (x,y)
or
∂ f . ∂x∂y
Likewise, the notation for the second partial with respect to y and then x is fyx (x,y)
or
∂ f . ∂y∂x
Finally, the notation for the second partial with respect to y and then y again is fyy (x,y)
or
∂ f . ∂y
Taking a partial derivative is fairly straightforward. Take the derivative with respect to the indicated independent variable, treating all other independent variables as constants. All of the tricks and techniques discussed in Section 4.8 apply. The following set of examples will demonstrate this process by finding all of the first and second derivatives of a function h(x,y) = ( x + )exy . Example 2. Find the first partial derivative with respect to x of h(x,y) = ( x + )exy . To calculate the first partial derivative with respect to x, we must treat y as a constant. It isn’t necessary to actually replace y, but if it helps, imagine that y = c in h(x,y). Then the function becomes h(x) = ( x + )ecx , which requires the product rule and the chain rule to differentiate: d d ( x + )ecx + ( x + ) (ecx ), dx dx d h′ (x) = ecx + ( x + )ecx (cx), dx h′ (x) = ecx + c( x + )ecx .
h′ (x) =
Now we replace c with y, and the first partial derivative with respect to x is hx (x,y) = exy + y( x + )exy .
Example 3. Find the first partial derivative with respect to y of h(x,y) = ( x + )exy .
MATHEMATICS FOR SOCIAL SCIENTISTS
216
To calculate the first partial derivative with respect to y, x is treated as a constant. This expression turns out to be easier to differentiate since the product rule is not needed. For clarity, as in Example 2, we replace x with c to represent a constant: h′ (y) = ( c + )
d cy (e ) = c( c + )ecy . dy
Replacing x for c, the first partial derivative with respect to y is hy (x,y) = x( x + )exy .
Example 4. Find the second partial derivative with respect to x and then with respect to x of h(x,y) = ( x + )exy . The second partial derivative with respect to x and x again is the first partial derivative of hx (x,y) with respect to x. In other words, hxx (x,y) =
∂ hx (x,y). ∂x
So we just take the derivative of hx (x,y), which we derived in Example 2: ∂ ( exy + y( x + )exy ) ∂x ∂ ∂ = ( exy ) + (y( x + )exy ) ∂x ∂x ∂ ∂ = yexy + y ( x + )exy + y( x + ) (exy ) ∂x ∂x = yexy + yexy + ( x + )y exy
hxx (x,y) =
= ( xy + y + y)exy .
Example 5. Find the second partial derivative with respect to y and then with respect to y of h(x,y) = ( x + )exy . The second partial derivative with respect to y and y again is the first partial derivative of hy (x,y) with respect to y. In other words, hyy (x,y) =
∂ hy (x,y). ∂y
So we just take the derivative of hy (x,y), which we derived in Example 3: hyy (x,y) =
∂ (x( x + )exy ) = x ( x + )exy . ∂y
CHAPTER 7. MULTIVARIATE CALCULUS
217
Example 6. Find the second partial derivative with respect to x and then with respect to y of h(x,y) = ( x + )exy . The second partial derivative with respect to x and then y is the first partial derivative of hx (x,y) with respect to y. In other words, ∂ hx (x,y). hxy (x,y) = ∂y So we just take the derivative with respect to y of hx (x,y), which we derived in Example 2: ∂ ( exy + y( x + )exy ) ∂y ∂ xy ∂ = (e ) + ( x + ) (yexy ) ∂y ∂y
hyy (x,y) =
= yexy + ( x + )(exy + y exy ) = (y + y + x + )exy .
There is no example for finding the second derivative of h(x,y) with respect first to y and then to x. The reason is that this derivative would be exactly equal to the answer we calculated in Example 6 by taking the derivative with respect to x and then y. In general, the order of differentiation does not change a mixed second partial derivative. A theorem tells us that the second partial derivative with respect to x and then y is the same as the second partial derivative with respect to y and then x. Theorem: For any differentiable function f(x,y), fxy (x,y) = fyx (x,y).
Proof : fxy (x,y) is the partial derivative with respect to y of fx (x,y): fx (x,y) − fx (x,b) y→b y−b
fxy (x,y) = lim
= lim
limx→a
f(x,y)−f(a,y) x−a
− limx→a
f(x,b)−f(a,b) x−a
y−b f(x,y) − f(a,y) − f(x,b) + f(a,b) = lim . (x − a)(y − b) (x,y)→(a,b) y→b
fyx (x,y) is the partial derivative with respect to x of fy (x,y): fy (x,y) − fy (a,y) fyx (x,y) = lim x→a x−a limy→b f(x,y)−f(x,b) − limy→b f(a,y)−f(a,b) y−b y−b = lim x→a x−a f(x,y) − f(a,y) − f(x,b) + f(a,b) . = lim (x − a)(y − b) (x,y)→(a,b)
MATHEMATICS FOR SOCIAL SCIENTISTS
218
Since fxy (x,y) and fyx (x,y) evaluate to the same expression, they are equal. Example 7. Show that kxy (x,y) = kyx (x,y) for the function k(x,y) = ln(xy + x − y). The first partial derivative with respect to x is kx (x,y) =
+ x − y) y+ = . xy + x − y xy + x − y
∂ (xy ∂x
The derivative of this result with respect to y requires the quotient rule: kxy (x,y) =
∂ ∂ (y + ) − (y + ) ∂y (xy + x − y) (xy + x − y) ∂y
(xy + x − y) xy + x − y − (y + )(x − ) = (xy + x − y) xy + x − y − xy − x + y + = (xy + x − y) =
.
(xy + x − y)
The first partial derivative with respect to y is ky (x,y) =
∂ (xy ∂y
+ x − y)
xy + x − y x− = . xy + x − y
The derivative of this result with respect to x also requires the quotient rule: ∂ ∂ (xy + x − y) ∂x (x − ) − (x − ) ∂x (xy + x − y) (xy + x − y) xy + x − y − (x − )(y + ) = (xy + x − y) xy + x − y − xy − x + y + = (xy + x − y)
kyx (x,y) =
=
(xy + x − y)
.
So kxy (x,y) = kyx (x,y).
7.3.2 Gradients and Hessians The first partial derivatives of a function can be written inside a vector, and the second partial derivatives of a function can be written inside a matrix. Matrices and vectors are the focus of the third part of this book, but they are briefly defined here. See Chapter 8 for a more detailed explanation.
CHAPTER 7. MULTIVARIATE CALCULUS
219
Definition: Matrix A two-dimensional array of objects where the row and column positions of each object have meaning. The objects are often, but not necessarily, numbers.
The objects inside a matrix are called elements, and each element in a matrix has a position determined by its row and its column. Matrices are a lot like tables in that the position of an element has substantive meaning. For example, Table 7.1 expresses the number of Democrats, Republicans, and Independents in each house of Congress. Table 7.1
Party Membership of the 113th Congress Expressed in a Table and in a Matrix Democrats Republicans Independents
House of Representatives
201
233
0
Senate
53
45
2
This information can also be expressed in a matrix, where the row of each element represents a house and the column of each element represents a political party. The ( , ) element and the ( , ) element are not interchangeable because the number of Democrats in the Senate does not equal the number of Republicans in the House of Representatives. The values of the individual matrix elements matter, but the position of these elements also matters. A vector is a particular kind of matrix: Definition: Vector A matrix with only one row or only one column.
A vector with only one row is called a row vector, and a vector with only one column is called a column vector. Vectors and matrices are important tools for describing the derivatives of multidimensional functions. The analog of a first derivative for a multidimensional function is a vector called a gradient, and the analog of a second derivative is a matrix called the Hessian. Definition: Gradient. A gradient is a vector of first partial derivatives of a multivariate function and is denoted with the symbol ∇. If g(x,y,z) is a differentiable function with three independent variables, then its gradient is given by ∂g ∂x ∇g = ∂g . ∂y ∂g ∂z
MATHEMATICS FOR SOCIAL SCIENTISTS
220
Definition: Hessian A Hessian is a matrix of second partial derivatives of a function. For the function g(x,y,z), the Hessian is denoted H(g). The Hessian matrix has three rows and three columns and is given by H(g) =
∂ g ∂x
∂ g ∂x∂y
∂ g ∂x∂z
∂ g ∂y∂x
∂ g ∂y
∂ g ∂y∂z
∂ g ∂z∂x
∂ g ∂z∂y
∂ g ∂z
.
Think of the Hessian matrix as a table where the rows and the columns represent the different independent variables of the function. The first row and first column both refer to x, the second row and column refer to y, and the third row and column refer to z. The ( , ) element of a Hessian contains the second partial derivative of the function with respect to x and x again. The ( , ) element contains the second partial derivative with respect to x then y, and so on. Remember that second partial derivatives do not depend on the order of differentiation, so we can rewrite the Hessian as H(g) =
∂ g ∂x
∂ g ∂x∂y
∂ g ∂x∂z
∂ g ∂x∂y
∂ g ∂y
∂ g ∂y∂z
∂ g ∂x∂z
∂ g ∂y∂z
∂ g ∂z
.
Since the second partial derivatives do not depend on the order of differentiation, every Hessian matrix is symmetric. That is, every element in the ith row and jth column is equal to the element in the jth row and ith column. Example 1. Find the gradient and Hessian of h(x,y) = ( x + )exy , and use this information to describe the first and second partial derivatives of the function at (x,y) = ( , ). We previously derived all of the first and second derivatives for this function in Examples 2 through 6 in Section 7.3.1. So the gradient of h(x,y) is the vector of the first partial derivatives: ∇h(x,y) = At the point (x,y) = ( , ), the gradient is ∇h( , ) =
exy + y( x + )exy . x( x + )exy
e = e
. . .
CHAPTER 7. MULTIVARIATE CALCULUS
221
So at the point (x,y) = ( , ), the function has a slope of 96.1 along the x-axis and a slope of 36.9 along the y-axis. The Hessian of the function is xy ( xy + y + y)e H(h(x,y)) = (y + y + x + )exy
and at (x,y) = ( , ) the Hessian becomes H(h( , )) =
(y + y + x + )exy , x ( x + )exy
e = e
e e
.
.
.
.
.
At this point, the slope on the x-axis is increasing by 236.5 units for every unit increase in x and by 110.8 units for every unit increase in y. The slope on the y-axis is increasing 110.8 units for every unit increase in x and 36.9 units for every unit increase in y. Clearly this function is increasing rapidly in all directions at the point (x,y) = ( , ).
Example 2. Find the gradient and Hessian matrix of f(x,y,z) = x + yz − xyz . The first partial derivatives are fx (x,y,z) = x − yz , fy (x,y,z) = z − xz , fz (x,y,z) = y − Therefore, the gradient is
xyz.
∇f(x,y,z) =
x − yz
.
z − xz y−
xyz
There are nine second partial derivatives, but we only need to find six of them, since the other three are repeated in the matrix. The second partial derivatives are fxx (x,y,z) = ,
fxy (x,y,z) = − z ,
fyy (x,y,z) = ,
fyz (x,y,z) =
Therefore, the Hessian matrix is
−
fxz (x,y,z) = − yz, xz,
fzz (x,y,z) = − xy.
( ) H f(x,y,z) = − z − yz
− z
− yz −
−
xz
. xz
− xy
MATHEMATICS FOR SOCIAL SCIENTISTS
222
7.3.3 Optimization Optimization in multiple dimensions is analogous to optimization in two dimensions, as discussed in Section 5.2. In two dimensions, we find critical points by setting the derivative equal to 0, and we can check to see whether these points are maxima or minima using the second derivative test. For a multidimensional function, we have to find the points at which the slope of the function is 0 in every direction, and we can check critical points by looking at the second partial derivatives. The multidimensional analog of a first derivative is a gradient. The first step to find the maxima and minima of a multivariate function is to find the critical points of the function. Definition: Critical point (of a multivariate function) A critical point is a set of values for the independent variables of a multivariate function such that every element of the gradient is equal to 0. For example, let f(x,y,z) be a function with three independent variables. If (x,y,z) = (a,b,c) is a critical point of the function, then the gradient at this point must be f (a,b,c) x ∇f(a,b,c) = fy (a,b,c) = . fz (a,b,c)
To confirm that a critical point for a function with one independent variable is a relative minimum, for example, the next step is to check that the second derivative is positive. But the multidimensional analog of a second derivative is a Hessian matrix, and there is no concept of positive and negative that applies to an entire matrix. A matrix, however, can be positive-definite or negative-definite, and these concepts are loosely analogous to positive and negative for a unidimensional second derivative. When a matrix is symmetric—that is, every element is equal to the element that switches the row and column coordinates, as is true of every Hessian matrix—then there are rules for determining whether a matrix is positive-definite, negative-definite, or neither. These rules, discussed in greater detail in Section 10.4.2, are manageable for ( × ) symmetric matrices but get complicated very quickly for larger matrices. Here, I present the methodology for determining whether a ( × ) matrix is positive-definite or negative-definite, or neither. Definition: Positive-definite and negative-definite for a ( × ) Hessian matrix Suppose that a function f(x,y) has a critical point at (x,y) = (a,b). The Hessian at this point is f (a,b) f (a,b) xx xy . H(f(a,b)) = fxy (a,b) fyy (a,b) If the Hessian is positive-definite or negative-definite, then the following condition must be true: fxx (a,b)fyy (a,b) − (fxy (a,b)) > .
CHAPTER 7. MULTIVARIATE CALCULUS
223
If this condition is not met, then the matrix is neither positive-definite nor negative-definite. If this condition is true, then if fxx (a,b) + fyy (a,b) > , the matrix is positive-definite, and if fxx (a,b) + fyy (a,b) < , the matrix is negative-definite. If the matrix is symmetric and the first condition is true, then it cannot be true that fxx (a,b) + fyy (a,b) = .
To find the extreme points of a multidimensional function, we perform the following steps: Optimization in multiple dimensions: For an n-dimensional, differentiable function f(x , x , . . . ,xn ), do the following: 1. Find the gradient ∇f. 2. Set each element of the gradient vector to 0, and solve the system of n equations and n unknowns. The n-dimensional solutions are the critical points. 3. Find the Hessian matrix H(f), and plug in the critical points:
• If the Hessian matrix is positive-definite at the critical point, then the critical point represents a local minimum.
• If the Hessian matrix is negative-definite at the critical point, then the critical point represents a local maximum.
• If the Hessian matrix is neither positive-definite nor negative-definite at the critical point, then the critical point represents one of the many bizarre-looking saddle points that are possible in multiple dimensions.
Let’s maximize, for example, the bivariate normal distribution N( , , , ,. ). That is, this distribution is the the joint probability distribution of two normal random variables x and y. The mean of x is 3, the mean of y is 4, the standard deviation of both variables is 1, and the correlation between the two variables is .5. This distribution has the functional form √
f(x,y) =
π
e−
(x +y −xy− x− y+ )
.
Since a normal distribution should be maximized at its mean, this distribution should have a local maximum at ( , ). But to prove that the distribution has a maximum point there, we can find the values of x and y that make the gradient equal to 0. Then we can confirm that this critical point is a maximum by checking the Hessian matrix. To find the partial derivative with respect to x, treat y as a constant and use the chain rule: ( ) √ ∂ fx (x,y) = π ∂x e− (x +y −xy− x− y+ ) ( ) ( ) √ ∂ = π e− x +y −xy− x− y+ ∂x − (x + y − xy − x − y + ) ( )( ) √ − x +y −xy− x− y+ = π e − ( x−y− ) . Similarly, the partial derivative with respect to y is ( )( ) √ fy (x,y) = π e− x +y −xy− x− y+ − ( y−x− ) .
MATHEMATICS FOR SOCIAL SCIENTISTS
224
Therefore, the gradient is
√
∇f(x,y) =
π √ π
e
−
e−
(
x +y −xy− x− y+
(
x +y −xy− x− y+
) − ( x−y− ) )( ) . − ( y−x− )
)(
We want to find the values of x and y that make both partial derivatives simultaneously √ equal to 0. For both of these equations, the front factors π , the exponential term, and − are never equal to 0, so we can divide both sides of the equation by these factors, which cancel out. The remaining system of equations to solve is x−y− = , y−x− = . Solving the top equation for y, we see that y= x− . Substituting for y in the second equation gives us ( x− )−x− = , x− −x− = , x− = , x= . Since x = ,
y= ( )− = .
The point ( , ) makes all partial derivatives simultaneously equal to 0; therefore, ( , ) is a critical point. To find the second partial derivative with respect to x and x again, we start with fx (x,y), treat y as a constant, and take the derivative: ( )( [ )] √ ∂ fxx (x,y) = π ∂x e− x +y −xy− x− y+ − ( x−y− ) ( ) ]( [ ) √ ∂ = π ∂x e− x +y −xy− x− y+ − ( x−y− ) ( )) [ ( ] ∂ + e− x +y −xy− x− y+ ∂x − ( x − y − ) ( )( ) √ = π e− x +y −xy− x− y+ − ( x−y− ) ( )( ) + e− x +y −xy− x− y+ − ( ) ( )( ( √ )) = π e− x +y −xy− x− y+ ( x−y− ) − ( )( ( √ )) − x +y −xy− x− y+ = π e . ( x−y− ) −
CHAPTER 7. MULTIVARIATE CALCULUS
225
By the same calculations, the second partial derivative with respect to y and y again is ( )( ( √ )) fyy (x,y) = π e− x +y −xy− x− y+ ( y−x− ) − . The second partial derivative with respect to x and then y is ( )( [ )] √ ∂ fxy (x,y) = π ∂y e− x +y −xy− x− y+ − ( x−y− ) ( ) ]( [ ) √ ∂ − ( x−y− ) = π ∂y e− x +y −xy− x− y+ ( )) [ ] ( ∂ − ( x − y − ) + e− x +y −xy− x− y+ ∂y ( )( ) √ = π e− x +y −xy− x− y+ ( x − y − )( y − x − ) ( )( ) + e− x +y −xy− x− y+ ( )( ) √ = π e− x +y −xy− x− y+ ( x − y − )( y − x − ) + . Plugging the point ( , ) into these equations, the second derivatives evaluate to fxx ( , ) = − . , fyy ( , ) = − . , fxy ( , ) = − . . So the Hessian is
H(f( , )) =
fxx ( , ) fxy ( , )
fxy ( , ) − . = fyy ( , ) − .
− .
.
− .
To test whether the critical point refers to a maximum, a minimum, or a saddle point, we need to see whether the quantity ( ) fxx ( , )fyy ( , ) − fxy ( , ) is positive. In this case, the quantity evaluates to 0.045. So this matrix is either positivedefinite or negative-definite. The second condition is fxx ( , ) + fyy ( , ), which evaluates to − . . Since this quantity is negative, the Hessian matrix of f is negativedefinite at the point ( , ), so the function has a local maximum at the point ( , ). Example 1. Find the critical points of f(x,y) = x y − x − xy +
,
and determine whether each critical point represents a local maximum, local minimum, or saddle point. The first step is to find and identify the critical points of the function. The gradient of f(x,y) is x y − x − y x y − x − y = . ∇f(x,y) = x − x x(x − )
MATHEMATICS FOR SOCIAL SCIENTISTS
226
The partial derivative with respect to y is equal to 0 when x = , x = − , or x = . If x = , then the partial derivative with respect to x is 0 when y = . Likewise, if x = − , then y should be − , and if x = , then y = . Therefore ( , ), (− , − ), and ( , ) are the critical points of the function. To check whether the critical points represent local maximum, minimum, or saddle points, we have to analyze the Hessian matrix. The Hessian in this case is ( ) xy − x − . H f(x,y) = x − At the critical points, the Hessian reduces to − − , H(f(− , − )) = H(f( , )) = −
,
H(f( , )) =
.
Next we check the two conditions discussed in Section 7.3.3 to see if these points represent local maxima, local minima, or saddle points. The first condition we check is whether fxx (a,b)fyy (a,b) − (fxy (a,b)) > . For the three critical points, this quantity evaluates to fxx ( , ) fyy ( , ) − (fxy ( , )) = −
× − (− ) × − = − ,
fxx (− , − ) fyy (− , − ) − (fxy (− , − )) =
× − ×
=− ,
fxx ( , ) fyy ( , ) − (fxy ( , )) =
× − ×
=− .
None of the critical points meet the first criterion for local maxima and minima, so these critical points are saddle points.
7.3.4 Finding the Best-Fit Line for Linear Regression This section builds on the discussion of linear regression in Section 2.5.3. Please read the earlier discussion before reading this one. There are plenty of equations here that can appear to be intimidating, but please read the section carefully. This section uses very little math beyond algebra and very simple partial differentiation. In Section 2.5.3, I introduced the concepts of data, scatterplots, best-fit lines, and residuals. Recall that a simple regression considers the relationship between one dependent y variable and only one independent x variable. The equation of the best-fit line produced by a simple regression is ˆyi = α + βxi , where α is the y-intercept of the best-fit line, β is the slope of the best-fit line, and ˆyi is the predicted value of y implied by the best-fit line for a particular observation i. The goal of linear regression is to find the values of α and β that create the best-fit line as opposed to any other line. The residuals ϵi are the differences between the y-value of the best-fit line and the
CHAPTER 7. MULTIVARIATE CALCULUS
227
y-value of the data point, so that the dependent variable yi is modeled by yi = α + βxi + ϵi . The best-fit line, by definition, is the line that minimizes the sum of squared residuals (SSR): N N ∑ ∑ SSR = ϵi = (yi − α − βxi ) . i=
i=
To minimize the SSR, we use the methods for finding the local minima of functions with two independent variables—discussed in Section 7.3.3. In this case, the two independent variables are α and β.3 We find the gradient by taking the partial derivative of the SSR first with respect to α and then again with respect to β. Then we set both partial derivatives equal to 0 and solve for α and β. Finally, we confirm that this solution is in fact a local minimum. Before we take any derivatives, we can simplify the expression a bit. If we multiply the square through the expression within the sum, we get SSR =
N ∑ (yi − αyi − βxi yi − αyi + α + αβxi − βxi yi + αβxi + β xi ) i=
=
N ∑ (yi + α + β xi − αyi − βxi yi + αβxi ). i=
We can rewrite this expression by distributing the summation sign and moving the constants and the α and β parameters outside each sum: SSR =
N N N N N N ∑ ∑ ∑ ∑ ∑ ∑ (yi ) + (α ) + (β xi ) − ( αyi ) − ( βxi yi ) + ( αβxi ) i=
=
N ∑ i=
i=
i=
yi + Nα + β
N ∑
i=
xi − α
i=
N ∑
yi − β
i=
i= N ∑
xi yi + αβ
i=
i= N ∑
xi .
i=
The partial derivative of SSR with respect to α is N N ∑ ∑ ∂(SSR) = Nα − yi + β xi , ∂α i=
i=
and the partial derivative of SSR with respect to β is N N N ∑ ∑ ∑ ∂(SSR) = β xi − xi yi + α xi . ∂β i=
i=
i=
3 Be careful: x and y represent data in this example. Since we know the values of our data points, these are not the quantities we need to estimate. Even though x and y are called “variables” from the perspective of data analysis, the actual variables for the purpose of optimization are α and β.
MATHEMATICS FOR SOCIAL SCIENTISTS
228
To find the critical points, we set each partial derivative equal to 0 and solve the system of equations. First, we set the partial derivative with respect to α equal to 0 and solve for α: N ∑
Nα −
yi + β
N ∑
i=
Nα =
N ∑
yi − β
i= ∑N i= yi
α=
xi = ,
i=
N
−β
N ∑
i= ∑N i=
xi , xi
N
.
∑ ∑ Notice that the fractions yi /N and xi /N are just the average values of y and x, respectively. We can denote these fractions as ¯y and ¯x and rewrite the solution for α as α = ¯y − β¯x. Next, we set the partial derivative with respect to β equal to 0, β
N ∑
N ∑
xi −
i=
xi yi + α
i=
N ∑
xi = ,
i=
and plug the value we just derived for α into the equation: β
N ∑
xi −
i=
We can distribute the β
∑
N ∑
N ∑
xi yi + (¯y − β¯x)
i=
N ∑
xi = .
i=
xi to each term in the parentheses, xi −
i=
N ∑
xi yi + ¯y
i=
N ∑
xi − β¯x
i=
N ∑
xi = .
i=
Then we bring every term containing β to one side of the equation and cancel out the 2s in front of every term: β
N ∑ i=
xi − β¯x
N ∑ i=
xi =
N ∑
xi yi − ¯y
i=
N ∑
xi ,
i=
N N N N ) ∑ (∑ ∑ ∑ xi = xi yi − ¯y xi . β xi − ¯x i=
Solving for β, we get
i=
i=
i=
∑N ∑N xi yi − ¯y i= xi . β = ∑i= ∑N N x i= xi i= xi − ¯
This equation is one correct way to write the formula for β, but it isn’t the most famous way to write ∑ the formula. We will rewrite this equation by first observing that the sum of a variable xi is equal to N times the average of the variable N¯x. We make these substitutions
CHAPTER 7. MULTIVARIATE CALCULUS
229
and get
∑N β = ∑i= N i=
xi yi − N¯x¯y xi − N¯x
.
This equation is the second most famous way to write the formula for β. To arrive at the most famous way to write the formula, we have to employ some algebraic tricks. First, we can rewrite the numerator N ∑
xi yi − N¯x¯y
i=
by subtracting and adding N¯x¯y, N ∑
xi yi − N¯x¯y + N¯x¯y − N¯x¯y,
i=
and subtracting the last term from the third term: N ∑
xi yi − N¯x¯y + N¯x¯y.
i=
We break the second term into two halves, N ∑
replace N¯x with
∑
xi yi − N¯x¯y − N¯x¯y + N¯x¯y,
i=
xi in the second term, and replace N¯y = N ∑
xi yi − ¯y
N ∑
i=
xi − ¯x
i=
N ∑
∑
yi in the third term:
yi + N¯x¯y.
i=
We can bring ¯x and ¯y inside the summations and replace N¯x¯y with N ∑
xi yi −
i=
N ∑
¯yxi −
i=
N ∑
¯xyi +
i=
N ∑
¯x¯y,
i−
and then combine all the terms within one summation: N ∑ (xi yi − ¯yxi − ¯xyi + ¯x¯y). i=
The terms inside the parentheses can now be factored: N ∑ (xi − ¯x)(yi − ¯y). i=
Now consider the denominator of the previous formula for β: N ∑ i=
xi − N¯x .
∑
¯x¯y,
MATHEMATICS FOR SOCIAL SCIENTISTS
230
Subtract and add N¯x , N ∑
xi − N¯x + N¯x − N¯x ,
i=
and add the third and fourth terms: N ∑
xi − N¯x + N¯x .
i=
The second term can be rewritten as the product of two ¯x terms, N ∑
and we can substitute N¯x with
∑
xi − N¯x¯x + N¯x ,
i=
xi in the second term: N ∑
xi − ¯x
i=
N ∑
xi + N¯x .
i=
∑Next, we can bring the ¯x coefficient inside the summation, and we can rewrite N¯x and ¯x , N N N ∑ ∑ ∑ ¯xxi + ¯x , xi − i=
i=
i=
and combine the terms within one summation, N ∑ (xi − ¯xxi + ¯x ), i=
which factors to
N ∑ (xi − ¯x) . i=
So the entire formula for β can be written as ∑N (xi − ¯x)(yi − ¯y) β = i=∑N . x) i= (xi − ¯ This equation is the most famous and user-friendly formula for β. We summarize as follows: The best-fit line for a simple regression is yi = ˆ α + ˆβxi , where ∑N ˆ α = ¯y − β¯x
and
ˆβ =
(xi − ¯x)(yi − ¯y) . ∑N x) i= (xi − ¯
i=
CHAPTER 7. MULTIVARIATE CALCULUS
231
We can intuitively confirm that this critical point represents a local minimum instead of a local maximum or a saddle point.4 The best possible value for the SSR is 0, and this value would indicate that every data point lies exactly on the best-fit line. It is impossible for the SSR to be negative since it is the sum of squared values. On the other hand, there’s no upper bound on the SSR. We can find worse-fitting lines for the data by infinitely increasing the y-intercept. Therefore, the SSR can have a local minimum but not a local maximum. This critical point is, therefore, a local minimum. We can use these formulas to find the slope and y-intercept of the best-fit line for the data presented in Section 2.5.3. Example 1. Use the formula for the slope and y-intercept of the best-fit line for data with one dependent variable and one independent variable to calculate the best-fit line for the example data presented in Section 2.5.3. The data from Section 2.5.3 are as follows: Observation Left-Right Obama Observation Left-Right Obama Ideology Approval Ideology Approval 1
–2.50
9.04
9
0.36
2.94
2
–2.14
1.97
10
0.71
–4.06
3
–1.79
3.21
11
1.07
1.99
4
–1.43
8.07
12
1.43
–5.18
5
–1.07
4.38
13
1.79
–4.22
6
–0.71
0.74
14
2.14
–5.92
7
–0.36
–1.71
15
2.50
–10.89
8
0.00
–2.20
In this case, the dependent variable yi is each respondent’s level of approval of Barack Obama as president, and the independent variable xi is each respondent’s placement on a left–right ideological scale.
4 It is also possible to prove this fact formally. The Hessian matrix for the ordinary least squares (OLS) estimator in two dimensions is ∑N N x i= i H(SSR) = . ∑N ∑N x x i i= i i=
A local minimum requires that the Hessian be positive-definite, and that proof requires that ( N)
N ( ∑ i=
) xi
−
N ) ( ∑ xi > . i=
This fact can be proved using an application of Lagrange’s identity or through mathematical induction on i. Both of these techniques, however, are well beyond the scope of this discussion.
MATHEMATICS FOR SOCIAL SCIENTISTS
232
To calculate the y-intercept α and the slope β of the best-fit curve, we have to calculate the mean of xi ; the mean of yi ; the differences from the mean, (xi − ¯x) and (yi − ¯y); the product of these differences, (xi − ¯x)(yi − ¯y); and the squared differences, (xi − ¯x) . Then we take the sum across i of (xi − ¯x)(yi − ¯y) and (xi − ¯x) . I have performed these calculations in the table below: i
yi (xi − ¯x) (yi − ¯y) (xi − ¯x)(yi − ¯y) (xi − ¯x)
xi
1
− .
9.04
− .
9.16
− .
6.25
2
− .
1.97
− .
2.09
− .
4.58
3
−.
3.21
−.
3.33
− .
3.20
4
−.
8.07
−.
8.19
− .
2.04
5
−.
4.38
−.
4.50
− .
1.14
6
− .
0.74
− .
0.86
− .
0.50
7
− .
−.
− .
− .
0.57
0.13
− .
0.00
0.00
1.10
0.13
8
0.00
9
0.36
10
0.71
11
1.07
12
1.43
− .
1.43
− .
− .
2.04
13
1.79
− .
1.79
− .
− .
3.20
14
2.14
− .
2.14
− .
− .
4.58
15
2.50 − .
2.50
− .
Mean
0.00
0.00
−.
2.94
0.36
− .
3.06 − .
0.71
1.99
1.07
− .
2.11
2.26
−
0.50 1.14
.
6.25
.
35.71
− . −
Sum
The slope of the best-fit line is given by ∑N x)(yi − ¯y) i= (xi − ¯ ˆβ = , ∑N x) i= (xi − ¯ which in this case is ˆβ = −
. .
=− . .
The y-intercept is given by ˆ α = ¯y − β¯x, which in this case is ˆ α=− .
+ .
×
=− . .
CHAPTER 7. MULTIVARIATE CALCULUS
233
So the best-fit line for the data is yi = − .
− .
xi .
This line is graphed against the data in Section 2.5.3.
One note of caution: The formulas for α and β only apply to the case with one—and only one—independent variable. When multiple x variables are introduced, these formulas become much more complicated unless they are expressed with matrices. The formula for the best-fitting linear function in which there are many x variables is called the OLS estimator and is presented in Section 9.3 as a topic in linear algebra.
7.3.5 Lagrange Multipliers Imagine again that you are standing on the surface of a three-dimensional function. This surface looks like a rolling range of hilltops. You stand on top of a hill and look down at all the peaks and valleys and twists and turns of the landscape down below. If the function has an unbounded domain, then the landscape will continue in all directions as far as you can see. But now imagine that a huge cylinder from outer space descends and cuts out a circle on the landscape, with you in the middle. This cylinder represents a bounded domain on the x and y dimensions. Suppose that the highest hilltop was a few miles away, very visible from where you stand, and that the cylinder cuts right through this mountain. The peak is cut off, but at the boundary, the mountain is still the highest point in the landscape. It is possible that the global maximum of a function in multiple dimensions exists right at the boundary. To find the local maxima and minima of a function with one independent variable, we find the critical points and evaluate each one using the first or second derivative test. But to find global maxima or minima, we have to compare the value of the function at the critical points and the boundary points. The method of Lagrange multipliers is a way to consider the boundary points of a multidimensional function. A function has boundary points when its domain is restricted. In two dimensions, a domain can be restricted to a particular range, such as x ∈ [ , ]. But in multiple dimensions, the domain can be bounded in many more geometrically creative ways. Instead of a space cylinder, imagine an armada of evil space cookie cutters attacking the Earth. These boundaries can have many kinds of shapes. The shape of the boundary is given by an equation called a constraint. A constraint is expressed as a level curve of a different function with the same number of dimensions. Remember that a level curve is the set of values of the independent variables that make the function equal to some set value. A Lagrange multiplier problem is to maximize a function when there exists another function that acts as a constraint. Constrained optimization using Lagrange multipliers: To find the global maxima and minima of a function f(x , x , . . . , xn ) subject to a constraint g(x , x , . . . , xn ) = c or g(x , x , . . . , xn ) ≤ c, do the following:
MATHEMATICS FOR SOCIAL SCIENTISTS
234
1. Find the gradient of the function to be optimized, ∇f(x , x , . . . , xn ). 2. Find the gradient of the constraint function, ∇g(x , x , . . . , xn ). For this step, ignore the right-hand side of the constraint, and only work with the function on the left-hand side of the equation or inequality. 3. Set ∇f(x , x , . . . , xn ) = λ∇g(x , x , . . . , xn ). This step involves writing down an equation for each dimension of the gradients. You are setting each partial derivative of f equal to λ times the corresponding partial derivative of g. These equations together form a system of equations, and your goal will be to solve for the values of the independent variables x , x , . . . , xn and, if necessary, λ that make every equation simultaneously true. 4. Add the constraint as one additional equation in the system of equations. If the constraint has a ≤ sign, treat it as an equal sign instead. 5. Once you have found the points (x , x , . . . , xn ) that solve the system of equations, calculate the value of the original function at each of these points. 6. Compare the value of f at the critical points that are local maxima and local minima to the value of f in the solutions to the system of equations. The highest value is the global maximum, and the lowest value is the global minimum.
For example, suppose that we want to find the global maximum and minimum points of f(x,y) = x + y , subject to the constraint
x +y ≤ .
Before we worry about the constraint, let’s find and identify the critical points of the function. The gradient of f(x,y) is x . ∇f(x,y) = y The only critical point is ( , ). To check whether this critical point represents a local maximum, a local minimum, or a saddle point, we have to analyze the Hessian matrix. The Hessian in this case is H(f(x,y)) =
,
which in this case is constant for all points, including (0, 0). Next, we check whether fxx (a,b)fyy (a,b) − (fxy (a,b)) > . In this case, the quantity is sign of
× −
= , which is indeed positive. Next, we check the
fxx ( , ) + fyy ( , ), which in this case is + = , which is positive, so the function has a local minimum at ( , ). In particular, the value of this minimum is f( , ) = . But this function is bounded above by the constraint, which means that the global minimum and maximum could exist at the constraint. To find these global extreme points, we use the method of Lagrange multipliers. The constraint function is g(x,y) = x + y ,
CHAPTER 7. MULTIVARIATE CALCULUS
235
and its gradient is
x . y
∇g(x,y) = Using the Lagrange requirement that
∇f(x,y) = λ∇g(x,y), we set up the following system of equations: x + y = , x = λ x, y = λ y. The second equation is true when x =
or λ = . If x = , then the first equation becomes y = ,
which yields solutions of (x,y) = ( , − ) and (x,y) = ( , ). If λ = , then the third equation is only true when y = . Plugging y = into the first equation gives us x = , which yields solutions of (x,y) = (− , ) and (x,y) = ( , ). So the method of Lagrange multipliers tells us to consider the points ( , − ), ( , ), (− , ), and ( , ) as candidates for global extreme points. The last step is to find the values of the function at these points, compare them against the value of the function at the critical point, and choose the highest of these values as the global maximum and the lowest as the global minimum. The value of the function at these points is f(− , ) = (− ) + ( ) = , f( , ) = ( ) + ( ) = , f( , − ) = ( ) + (− ) = f( , ) = ( ) + ( ) =
, .
So the global minimum of the function exists at ( , ), and the global maximum exists at two points, both ( , − ) and ( , ). Example 1. The global maximum of the function f(x,y) = e−x
− y +
is the point ( , ). Find the maximum point of the function on the line y= x− , and compare it with the global maximum. In this problem, the constraint can be rewritten as x−y= ,
MATHEMATICS FOR SOCIAL SCIENTISTS
236
so that g(x,y) = − x + y. First, we find the gradient of f(x,y). The first partial derivative with respect to x is ∂ −x − y + e = − xe−x − y + , ∂x and the first partial derivative with respect to y is ∂ −x e ∂y
− y +
= − ye−x
− y +
.
The partial derivatives of g(x,y) are ∂ ( x − y) = − . ∂y
∂ ( x − y) = , ∂x
Next, we set each partial derivative of f equal to λ times the corresponding partial derivative of g, yielding this system of equations: − xe−x − y + = λ, − ye−x
− y +
= −λ.
We also include the constraint in this system: − x+y= , − xe−x − ye−x
− y + − y +
= λ, = −λ.
Solving the system of equations can be tricky. One technique that often works is to solve the partial derivative equations for λ and then to set these expressions equal to one another. In this instance, we only need to multiply the third equation by − to solve for λ. Solving the second equation for λ gives us λ = − xe−x
− y +
,
and setting the expressions for λ equal to each other gives us − xe−x
− y +
= ye−x
− x = y, −x = y, x+ y= . The system of equations is now
x−y= , x+ y= ,
− y +
,
CHAPTER 7. MULTIVARIATE CALCULUS
237
which we can solve by plugging x = − y into the first equation: (− y) − y = , − y−y= , − y= , y=−
.
Now we plug y into the second equation to solve for x: ( ) x+ − = , x− = ,
x=
.
The point ( / , − / ) is the maximum point5 of the function on the line y = x − . The value of the function at this point is ( ) ( ) ) ( − + , = e− f = . , and the value of the function at the global maximum ( , ) is f( , ) = e−(
) − ( ) +
=e =
. .
7.4 Multiple Integrals As discussed in Section 6.1, integrals have two purposes. They can be used to find antiderivatives or the area under a curve. Likewise, multiple integrals also have two purposes: (1) they can evaluate a higher dimensional space underneath a curve (definite multiple integrals) and (2) they can reverse the operation of partial differentiation (indefinite multiple integrals).
7.4.1 Notation Multiple integrals are written with several integral signs. For a three-dimensional function f(x,y), multiple integrals are called double integrals and are written ∫∫ ∫∫ f(x,y)dy dx or f(x,y)dx dy. For a four-dimensional function f(x,y,z), multiple integrals are called triple integrals, which take the form ∫∫∫ f(x,y,z)dz dy dx, where the difference terms dx, dy, and dz can be written in any order, so that
5 We can use the multivariate second derivative test to demonstrate that this point is indeed a maximum on the line, instead of a minimum or a saddle point.
MATHEMATICS FOR SOCIAL SCIENTISTS
238
∫∫∫
∫∫∫ f(x,y,z)dz dy dx
=
∫∫∫ =
∫∫∫ f(x,y,z)dz dx dy =
∫∫∫ f(x,y,z)dy dx dz
=
f(x,y,z)dy dz dx ∫∫∫
f(x,y,z)dx dz dy =
f(x,y,z)dx dy dz.
Although the order of the difference terms can be changed, the best practice is to write them in the order in which you will solve the integral for each variable. As will be discussed in Sections 7.4.2 and 7.4.3 below, the best approach for solving multiple integrals is to think of them as a series of integrals with only one independent variable each. The first difference term refers to the variable that you choose to integrate first, the second difference term refers to the variable that you choose to integrate second, and so on. So writing “dy dx dz” means that you intend to integrate the function first with respect to y, then with respect to x, and finally with respect to z. A multiple integral can have up to as many integral signs as independent variables. In general, an n-tuple integral has n integral signs. The purpose of indefinite multiple integrals is to undo the mathematical operation of partial differentiation. Since no area, volumes, or higher-dimension spaces are calculated, they have no region of integration to worry about. For definite multiple integrals, however, the goal is to evaluate the amount of space underneath a multidimensional function within a region of the graph. If a function has two independent variables, then the function is three-dimensional, and this space is called volume. A definite multiple integral uses a subscript A to express the region of integration, or the bounds of the graph under which we will evaluate the volume. A is a piece of notation that stands in for a specific region on which a multiple definite integral will be evaluated. For example, if we wanted to find ∫∫ x + y dy dx, A
then we are trying to find the volume underneath the graph f(x,y) = x + y inside the region implied by A. Suppose that { } A = (x,y) ∈ R | ≤ x ≤ , ≤ y ≤ , where R is the set of ordered pairs where each number within the pair is real. Then the region of integration is a square on the (x,y) plain, where x goes from 0 to 3 and y goes from 2 to 5. The integral that gives us this volume underneath f(x,y) bounded by this square is ∫ ∫ x + y dy dx, where the first integral sign contains the bounds on x and the second integral sign contains the bounds on y. The order of these integral signs matters because they match up with the difference terms dx and dy at the end of the expression. The first integral sign always matches up with the last difference expression, the second integral sign matches up with the second-tolast difference expression, and so on. In fact, multiple integrals can be thought of as integrals
CHAPTER 7. MULTIVARIATE CALCULUS
239
inside other integrals. We can rewrite the above double integral as ∫ (∫
) x + y dy dx.
Recall from Section 6.5.3 that these bounds can also be infinite. Example 1. Express the multiple integral of h(a,b) over the region A = {(a,b) ∈ R | a > , b
, < y
∫
) and y
, then the matrix is positive-definite, and if fxx (a,b) + fyy (a,b) < , then the matrix is negative-definite. If the matrix is symmetric and the first condition is true, then it isn’t possible that fxx (a,b) + fyy (a,b) = .
These rules, which apply only to a ( × ) Hessian, are a work-around for the fact that this book presents multivariate optimization before presenting linear algebra. Now that you are acquainted with the concept and derivation of eigenvalues, there is a much simpler and more general definition of positive- and negative-definite: Definition: Positive-definite and negative-definite for any square matrix A square matrix is positive-definite if all of its eigenvalues are positive. A square matrix is negative-definite if all of its eigenvalues are negative.
Note that this definition is both much simpler and applicable to every Hessian, even those that refer to functions in many more than two dimensions. In Section 7.3.3, we found the critical points of the bivariate normal distribution N( , , , , . ), √ f(x,y) =
− / (x +y −xy− x− y+ )
π
e
,
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
351
by taking the gradient and setting it equal to 0. We found that the critical point occurs at the mean, ( , ). The Hessian at this point is ( ) − . − . H f( , ) = . − . − . To determine whether the critical point (3,4) represents a local maximum, a local minimum, or a saddle point, we conducted the second partial derivative test for this critical point using the complicated definition presented above. We can perform this test more elegantly by finding the eigenvalues of this matrix. First, we plug the matrix into the equation (A − λI)x = , ( ) − . − . x − λ = , − . − . x (− . − λ) − . x . a = − . (− . − λ) x We require the determinant of this matrix to be 0, implying that (− .
− λ) − .
= ,
− .
= ,
λ + . λ+ .
= .
λ + . λ+ .
We can solve for λ using the quadratic formula: √ √ − . ± . − ( )( . ) . − . ± λ= = () √ − . ± . − . ± . = = , λ=
− . − .
=− .
and
λ=
− . + .
− .
=− . .
Since both eigenvalues are negative, the Hessian matrix at the critical point is negativedefinite. Therefore, the second partial derivative test indicates that this critical point represents a local maximum. Example 1. Determine whether the following matrices are positive-definite, negative-definite, or neither: − − . , B = , C = A= For each of these matrices, we will write the characteristic equation |A − λI| =
MATHEMATICS FOR SOCIAL SCIENTISTS
352
and find every solution for λ. These solutions are the eigenvalues of the matrix. If all of the eigenvalues are positive, then the matrix is positive-definite. If all of the eigenvalues are negative, then the matrix is negative-definite. If some of the eigenvalues are positive and others are negative, then the matrix is neither positive-definite nor negative-definite.4 The characteristic equation of matrix A is − λ
= ,
= ,
−λ −λ
( − λ)( − λ) − λ −
= ,
λ+
= ,
(λ − )(λ −
)= ,
λ= , . Since both eigenvalues are positive, A is positive-definite. The characteristic equation of matrix B is − λ = , −λ ( − λ)( − λ) −
= ,
λ −
= ,
(λ −
λ−
)(λ + ) = ,
λ=
,− .
Since B has one positive eigenvalue and one negative eigenvalue, B is neither positive-definite nor negative-definite. Finally, the characteristic equation of C is − −λ − = , −λ (− − λ)(−λ) + = , λ + λ+ = , (λ + )(λ + ) = , λ = − ,− . Since both eigenvalues are negative, C is negative-definite.
4 If all of the eigenvalues are greater than or equal to 0, that is, if they are all either positive or 0, then the matrix is called positive-semidefinite. If all of the eigenvalues are less than or equal to 0, then the matrix is negativesemidefinite.
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
353
Example 2. In Example 1 in Section 7.3.3, we calculated that the critical points of f(x,y) = x − y − x − xy + are ( , ), (− , − ), and ( , ). We also found that the Hessian is ( ) xy − x − . H f(x,y) = x − When we plug each critical point into the Hessian, it becomes ( ) ( ) − − , H f(− , − ) = , H f( , ) = −
( ) H f( , ) =
.
Previously, to test whether each critical point was a local maximum, a local minimum, or a saddle point, we applied the rules listed in Section 7.3.3. Now we can find the eigenvalues of each critical point’s Hessian to determine if it is positive-definite, negative-definite, or neither. If a critical point’s Hessian is positive-definite, then the critical point represents a local minimum. If a critical point’s Hessian is negative-definite, then the critical point represents a local maximum. Otherwise, the critical point is a saddle point. First, we calculate the eigenvalues of the Hessian for the critical point ( , ). The characteristic equation for this matrix is − −λ − = , − −λ (−
− λ)(−λ) −
= ,
λ−
= .
λ +
We need to use the quadratic formula to find the solutions to this equation: √ √ √ − ± − ( )(− ) − ± + − ± λ= = = √ √ − ± = =− ± , λ=− . , . . Since this matrix has one positive and one negative eigenvalue, the matrix is neither positivedefinite nor negative-definite. Therefore, the critical point ( , ) is a saddle point. The critical points (− , − ) and ( , ) share the same Hessian. The characteristic equation for this Hessian is −λ = , −λ
MATHEMATICS FOR SOCIAL SCIENTISTS
354
(
− λ)(−λ) −
= ,
λ−
= .
λ + Again, we have to apply the quadratic formula: √ ± − ( )(− ) λ= = √ √ ± = = ± , λ=− .
and
.
±
√
+
=
±
√
.
Since this matrix has one positive and one negative eigenvalue, the matrix is neither positivedefinite nor negative-definite. Therefore, the critical points (− , − ) and ( , ) are saddle points as well.
10.4.3 Finding Eigenvectors Once we know the eigenvalues λ, we can plug them into the equation (A − λI)x = . We already know that the determinant of (A−λI) will be 0 because we calculated the values of λ that made the determinant equal to 0. Therefore, this is a homogeneous system of equations with infinitely many nontrivial solutions. We can find the eigenvectors by following the steps we followed in Section 10.3: • Use elementary row operations on the augmented matrix implied by the equation (A − λI)x = . • Reduce until one row becomes all 0s. • Solve the system of equations in terms of the free variable. Consider again the matrix
A=
. −
−
In Section 10.4.1, we determined that the eigenvalues of this matrix are λ = and λ = − . We will determine the set of eigenvectors of A that are associated with each eigenvalue. First let’s consider λ = . Plugging this eigenvalue into (A − λI)x = gives us ] [ ])[ ] [ ] ([ x − = , − − x [ ][ ] [ ] x = . − − x
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
This equation implies the system of equations x + x = , − x − x = , with the augmented matrix
[
] −
−
.
We now perform elementary row operations on the augmented matrix. We add two times the first row to the second row to obtain [ ] ∼
.
This matrix implies that the system of equations is now x + x = , x =x , where we can only make a trivial statement about the free variable x since this row reduced to the trivially true statement = . Solving the first equation for x , we get x =− x , x =x . The solutions can be written in terms of the free variable x as follows: [ ] [ ] [ ] x − = − x =x . x x This expression represents the set of eigenvectors of A associated with the eigenvector λ = . We can plug in any value we want for x (other than 0, as this produces a vector with no direction and no magnitude), and we will obtain another valid eigenvector. Note that ] [ − , which we demonstrated to be an eigenvector of when x = , we have the vector this matrix in Sections 8.5 and 10.4.1. Other eigenvectors associated with λ = include [ ] [ ] [ ] − . , , , and . − . −, Each and every single one of these eigenvectors is mapped to three times itself when left] [ . One important eigenvector is the unit eigenvector—the eigenmultiplied by − − vector with a length of exactly 1. As described in Section 8.5, the length of a vector is given by the following definition:
355
MATHEMATICS FOR SOCIAL SCIENTISTS
356
Definition: The length of a vector A is denoted ||A||. [ ] √ • The length of a two-dimensional vector a is ||A|| = a + a . a a √ • The length of a three-dimensional vector a is ||A|| = a + a + a . a √∑ N • In general, the length of an N-dimensional vector is ||A|| = n= an .
We can find the unit eigenvector by taking any eigenvector and dividing by the length of that eigenvector. The new vector will be an eigenvector of length 1. [ Example 1. Find the unit eigenvector of
] −
associated with λ = .
−
We’ve already determined that the eigenvectors associated with λ = are of the form x −
.
We need to choose one eigenvalue and divide by its length. Take, for instance, the eigenvector ] [ − . The length of this vector is −
√ = −
+
=
√
Dividing by the length, the unit eigenvector associated with λ = √ − / . √ /
is
]
[ Example 2. Find the unit eigenvector of
.
−
−
associated with λ = − .
To find the eigenvectors associated with λ = − , we first plug λ = − into (A − λI)x = : ) ( , + x = x − − −
−
x = x
.
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
357
This equation implies the system of equations x + x = , − x −x = , with the augmented matrix
. −
−
We now perform elementary row operations on the augmented matrix. We interchange the rows and add three times the first row to the second: − − − − ∼ . ∼ This matrix implies that the system of equations is now − x −x = , x =− x , =⇒ x = x , x = x . The solutions can be written in terms of the free variable x as follows: x − x − = =x , x x which expresses the set of eigenvectors associated with λ = − . [ ] To find the unit eigenvector, consider the eigenvector − . The length of this vector is √ − = − +
=
√
.
Dividing by the length, the unit eigenvector associated with λ = is √ − / . √ /
10.5 Statistical Measurement Models In Jacobelis v. Ohio (1964), the U.S. Supreme Court declined to overturn a state law that banned obscene material. The majority instead ruled narrowly that the particular film in question did not count as obscene and was therefore protected under the first amendment. Justice Potter Stewart, concurring, wrote, “I shall not today attempt further to define the kinds of material I understand to be embraced within that shorthand description [pornography];
358
MATHEMATICS FOR SOCIAL SCIENTISTS
and perhaps I could never succeed in intelligibly doing so. But I know it when I see it, and the motion picture involved in this case is not that” (italics added). As a result of the narrow ruling, the Supreme Court was obliged to rule on a case-by-case basis on whether a particular speech counted as obscene. Reportedly, the justices would hold infamously awkward sessions to meet and watch contested films: These films ranged from scientific documentaries to the improbable escapades of lesbian nymphomaniacs. Justice Thurgood Marshall, a civil rights hero, took merciless pleasure in narrating the clips for the special benefit of Justice John Marshall Harlan Jr., an elegant former Wall Street lawyer who was by then losing his eyesight. Mocking Justice Potter Stewart’s insistence that “I know it when I see it,” clerks would call out in the dark, “I see it, I see it!”5 All this is to say that “I know it when I see it” leaves a lot to be desired as a means of measuring a social concept. In the social sciences, we very often deal with variables that exist and have important implications but cannot be directly observed. Quantifying such a variable involves three steps: 1. Characterization. Describe the concept you are attempting to measure. Have an idea about which cases and features fit within the concept and which do not. To proceed, you have to believe that it actually makes sense to quantify the concept. 2. Representation. What observable variables are strong enough implications of the concept to warrant being included in the measurement model? 3. Measurement. How will you combine the observed items? To illustrate the idea of statistical measurement, let’s consider the idea of regional authority.6 In some countries, regional governments have little or no power over social policy and little influence on the federal government. In other countries, regional governments have a great deal of control over social policy. Our goal is to use observable data to create an index of regional authority that can be used to compare different countries. But, first, we need to specify how the concept of regional authority should be characterized, represented, and measured. 1. Characterization. The latent variable is the strength of regional governments in a country relative to the federal government. Greece, for example, has 13 regions, but these regions traditionally have very little autonomy or influence and were only granted elected governors as of 2011. In contrast, Belgium has two regions with near-perfect autonomy from each other, and the federal government mostly exists to settle disputes between the two regions. There are two dimensions to regional authority. One dimension is self-rule, which represents the extent to which regions are autonomous. The other dimension is shared 5 Tribe, L. (2014, June 3). Free speech and the Roberts court: Uncertain predictions. The Washington Post. Retrieved from www.washingtonpost.com/news/volokh-conspiracy/wp/2014/06/03/free-speech-and-the-robertscourt-uncertain-protections. 6 The data used here are from Hooghe, L., Marks, G., & Schakel, A. H. (2010). The rise of regional authority: A comparative study of 42 countries. London, UK: Routledge, and are made publicly available by the authors on webpage http://www.unc.edu/∼hooghe/data_ra.php.
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
rule, which represents the extent to which regional governments influence federal policy. 2. Representation. Regional authority is measured with eight indices: a. Institutional depth (ID): Are regional governments concentrated in a capital? Are they subject to federal veto? b. Policy scope (PS): Do regions make their own educational, immigration, and welfare policies? c. Fiscal autonomy (FA): Do regions set their own major tax rates? d. Representation (R): Does the region have an independent legislature and executive? e. Lawmaking (LM): Do regions send representatives to the national legislature? f. Executive control (EC): Does the federal executive negotiate decisions with regional governments? g. Fiscal control (FC): Does the federal legislature negotiate tax law with regional governments? h. Constitutional reform (CR): Can regions propose/veto federal constitutional amendments? The first four variables focus on self-rule, and the last four focus on shared rule. 3. Measurement: How should these indices be weighted? Do they imply one latent dimension or more than one? A statistical measurement model, such as principal components analysis (PCA) or correspondence analysis (CA), enter into the process in Step 3.7 PCA is designed to conduct measurement from continuous observed indices, and CA is designed to conduct measurement from unordered categorical indices. The eight indices in the regional authority data are coded as continuous at the country level and categorical at the regional level. In Section 10.5.1, we will first apply PCA to the continuous, country-level indices; then, in Section 10.5.2, we will apply CA to the categorical, regional-level indices. PCA and CA are two simple but important examples of scaling, a particular kind of measurement, in the social sciences. For example, DW-NOMINATE is the most widely used software for the estimating ideological ideal points of members of Congress,8 and the DWNOMINATE algorithm uses principal components. CA is similar to a technique called latent semantic analysis, which is used to analyze texts and determine the similarity of particular words and phrases.9 Please note that the following discussions are not intended to replace a more rigorous presentation of these methods in a course on statistical measurement. The goal 7 PCA and CA are foundational topics in a course on statistical measurement, but there are many other measurement techniques that are useful for social science research, such as exploratory and confirmatory factor analysis, structural equation models, cluster and latent class analysis, multidimensional scaling, item response theory, finite mixture models, and ideal point models. For more information on PCA and CA generally, see chapters 4 and 5 of David J. Bartholomew, Fiona Steele, Irini Moustaki, and Jane Galbraith, (2008). Analysis of Multivariate Social Science Data, Second Edition. Boca Raton, FL: Chapman & Hall/CRC. 8 For more information about DW-NOMINATE, see http://voteview.com/. 9 For more information on latent semantic analysis, see Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch, eds, (2014). Handbook of Latent Semantic Analysis. University of Colorado Institute of Cognitive Science Series. New York, NY: Routledge.
359
MATHEMATICS FOR SOCIAL SCIENTISTS
360
here is to show how a few common social science analytic tools depend on eigenvalues and eigenvectors.
10.5.1 Principal Components Analysis Once we have a clear idea about the concept we are trying to measure and about the observable variables that we will use to measure it, we have to decide how to combine these variables into one measurement. Some researchers use proxies—the substitution of one observed variable for the latent concept—or use an unweighted sum of several proxies. The problem with these approaches is that they do not take into account the relative validity of each proxy as a measure of the latent concept. PCA is a means to weight the proxies in an informed way and allows us to • derive weights (loadings) for the items that—like regression—maximize the explained variance of the latent variable, • examine the loadings to consider which items matter for which latent variables, • and make an informed decision about how many latent variables (dimensions) exist. PCA takes advantage of the ability of eigenvalues and unit eigenvectors to succinctly summarize the behavior of a larger matrix. The components of the eigenvalue equation Ax = λx have direct interpretations in data analysis. Specifically, A represents a covariance (or a correlation) matrix of the variables that will be used to measure the latent variable. The eigenvalue λ represents the variance of the latent variable, and the unit eigenvector x contains the weights (loadings) to place on each variable. To run PCA on measurement data, we first construct a covariance matrix for the observed metrics. The technique for building the covariance matrix is discussed in Sections 7.4.4 and 9.3, although computers can perform these computations much more quickly. Example 1. The covariance matrix for the regional authority data is as follows:10 ID
PS
FA
R
ID
2.29
PS
2.17 2.65
FA
1.83 2.24 2.57
R
3.25 3.28 2.73 5.31
LM
EC
FC
CR
LM 0.66 0.73 0.73 0.90 0.49 EC
0.54 0.67 0.52 0.77 0.26 0.35
FC
0.59 0.66 0.56 0.76 0.38 0.20 0.53
CR
1.23 1.36 1.32 1.65 0.69 0.38 0.80 1.83
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
361
The upper-diagonal elements are left blank because the matrix is symmetric and they are equal to the lower-diagonal elements.
If some of the variables have much greater variances than the others, it may be better to use the correlation matrix for the variables instead of the covariance matrix. Next, we find the eigenvalues and the unit eigenvectors of the covariance matrix. Again, a computer is helpful for this task because computer algorithms for calculating eigenvalues and unit eigenvectors are fast and accurate and they save us the trouble of solving an eight-degree polynomial equation. Example 2. The eigenvalues for the regional authority covariance matrix are λ = 12.70 λ = 0.22 λ = 1.52
λ = 0.18
λ = 0.84
λ = 0.14
λ = 0.34
λ = 0.07
The unit eigenvectors associated with each eigenvalue are as follows: λ
λ − .
λ
λ
0.21
0.07
ID
0.40
PS
0.44
0.01 − .
FA
0.39
0.21
R
0.62 − .
− .
λ 0.20 − .
0.56 − . − .
λ
λ
λ
0.14
0.04
0.01 − .
0.18 − .
0.01
− . 0.21
0.32 − .
0.04
0.45 − .
0.05
0.18
0.70
0.18
0.00 − . 0.59
LM 0.14
0.26
EC
0.11
0.06 − .
0.57
0.16
0.24
FC
0.12
0.35
0.23
0.20
0.19
0.03 − .
CR
0.26
0.71
0.37 − .
− .
0.04
0.06
0.48 0.49
0.30 − .
The eigenvalues represent how much of the total variance of the data can be ascribed to each dimension.11 To calculate the total percentage of the variance ascribed to each dimension, we divide that dimension’s eigenvalue by the sum of the eigenvalues. The percentage of the variance explained by each dimension can inform us about the potential number of 10 These
data vary both across countries and over time. To perform the analysis correctly, we must find a way to explicitly model and account for the time variation. In this discussion, however, it is easier to ignore the time variation in order to keep the focus on the mechanisms of PCA. 11 It’s also the case that PCA ensures that these eigenvectors are independent; that is, if we were to graph them as we graphed vectors in Section 8.5, each of these eigenvectors would form a right angle with every other eigenvector. Another word for independent is orthogonal, and two unit vectors that are orthogonal are called orthonormal.
MATHEMATICS FOR SOCIAL SCIENTISTS
362
relevant dimensions of the latent variable. If the first dimension explains a dominant share of the variance, that is strong evidence that the latent variable is unidimensional. Example 3. The first three eigenvalues of the regional authority data explain 93% of the variance of the data. Dimension 1: 79% .
+ .
+ .
+ .
. + .
+ .
+ .
+ .
+ .
+ .
+ .
. + .
+ .
+ .
+ .
+ .
+ .
+ .
. + .
+ .
+ .
+ .
= . .
Dimension 2: 9% .
= .
.
= .
.
Dimension 3: 5% .
The number of relevant dimensions is a judgment call. We can defend the choice of extracting one latent variable because the first dimension explains 79% of the variance, but we can feasibly work with one, two, or three dimensions. Other dimensions each explain no more than 2% of the variance and are thus excluded.
To construct the underlying latent indices, we multiply each variable by the corresponding element of the unit eigenvector for that dimension, and then add them together. The elements of the unit eigenvector are the weights for each item. Example 4. The first dimension of the regional authority index (RAI), as calculated with PCA, is RAI = .
ID + .
PS + . FA + .
R + . LM + . EC + . FC + .
CR.
This index seems to favor the self-rule indices over the shared-rule indices. The factor that receives the highest weight in this index is whether the region has its own legislature and executive, and the factor that receives the smallest weight is whether the federal executive negotiates federal policy with regional governments.
10.5.2 Correspondence Analysis The idea that an underlying latent variable influences a set of categorical variables is related to a statistical test of association. Two categorical variables are associated if certain categories of one variable are more likely when the other variable is a particular category. Consider, for example, the fiscal control and constitutional reform indices in the regional-level RAI data. The extent to which a regional government exerts control over federal fiscal policy is coded as one of the following categories: • Category 0. The regional government has no influence over federal fiscal policy (a low level of regional authority).
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
363
• Category 1. Federal policymakers consult with regional governments over fiscal policy, but regional governments do not have veto power over fiscal policy (a medium level of regional authority). • Category 2. Regional governments have the power to veto federal fiscal policy (a high level of regional authority). The extent to which a regional government influences the process of federal constitutional reform is coded as follows: • Category 0. The central government has the power to change the constitution without any deference to regional-based representation (a low level of regional authority). • Category 1. Constitutional changes must be approved by a federal legislature with regionally based representation (a medium–low level of regional authority). • Category 2. Regional governments participate in the debate over constitutional changes and can propose rules to change the process through which the changes are adopted (a medium–high level of regional authority). • Category 3. A majority of regional governments can veto proposed changes to the constitution (a high level of regional authority). A cross-tabulation of the data is a table that counts the number of co-occurrences of each pair of categories. The cross-tabulation for fiscal control and constitutional reform is listed in Table 10.1. The data contain 1,752 observations of regions that have no power to control federal fiscal policy and no power to influence constitutional changes, 290 observations of regions with some influence over fiscal policy but no influence over constitutional changes, and so on.12 Table 10.1 A Cross-Tabulation of the Fiscal Control and Constitutional Reform Indices of Regional Authority Fiscal Control Low Low
Medium
High
,
Constitutional Medium-low Reform
Medium-high High
Does it look like these two variables are associated? To see this relationship more clearly, we can add row percentages—the entry for each cell divided by the total for each row— to the cross-tabulation. The new cross-tabulation is listed in Table 10.2. Of the regions that 12 These data record every region in 42 countries over a time frame of 56 years. As researchers, we would be required
to take steps to model changes over time directly. But here, as with the discussion for PCA in Section 10.5.1, I ignore the time variation to present the mechanics of correspondence analysis clearly.
MATHEMATICS FOR SOCIAL SCIENTISTS
364
Table 10.2 A Cross-Tabulation of Fiscal Control and Constitutional Reform With Row Percentages Fiscal Control
Low
Medium-low Constitutional Reform
Medium-high
High
Total
Low
Medium
High
Total
1,752
290
28
2,070
84.6%
14.0%
1.4%
100%
54
326
56
436
12.39%
74.77%
33
34
160
227
14.5%
15.0%
70.5%
100%
121
153
217
491
24.6%
31.2%
44.2%
100%
1,960
803
461
3,224
60.8%
24.9%
14.3%
100%
12.84% 100%
have low levels of regional authority on constitutional reform, 84.6% also have low regional authority on fiscal control. But only 14.5% of the regions with medium-high levels of regional authority on constitutional reform have low regional authority on fiscal control. It’s pretty clear that greater degrees of fiscal control are more likely for regions with greater regional authority on constitutional reform.13 If there is an underlying variable, like regional authority, that is exerting an influence on constitutional reform and fiscal control simultaneously, then we would expect to see an association like this one. The question then becomes, how do we measure that underlying variable, and what do the categories of each observed variable tell us about the latent variable? The goal of correspondence analysis is to create a map of the categories, in which the latent variables constitute the axes. Categories that are close together on the map are similar results of a particular level of the latent variable. These maps can include one, two, or sometimes three dimensions. Figure 10.3 is a simple illustration of a correspondence map. This particular map has only one dimension, which I choose to call “regional authority.” Low levels of regional authority tend to produce low levels of regional influence with regard to constitutional reform and fiscal control. This analysis does not find that there’s a great deal of distance between non-low categories.
13 A formal statistical test to see whether the variables are truly associated is the χ
test of association. Most textbooks in introductory statistics include a thorough discussion of the χ test. See, for example, Wackerly, D. D., Mendenhall, W., III, & Scheaffer, R. L. (2008). Mathematical statistics with applications (7th ed.). Belmont, CA: Thompson Higher Education, pp. 714–716.
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
365
0.02
Figure 10.3 Correspondence Analysis Coordinates for the Categories of the Constitutional Reform and Fiscal Control Indices
FC: Low
0
0.003
0.003
0.004
CR: High
CR: Med−Low
FC: High
FC: Med
−0.04
−0.02
CR: Med−High
−0.06
Level of Regional Authority
0.00
0.007 CR: Low
−0.06
−0.08
−0.066
Note: CR = constitutional reform, FC = fiscal control.
The mathematical process for deriving these coordinates involves eigenvalues and eigenvectors, which, as we saw with PCA, are extremely useful for breaking a joint covariance into independent components. Researchers almost always conduct a correspondence analysis with the help of preprogrammed computer algorithms. But like the OLS estimator for linear regression coefficients, it is useful to work through the steps by hand a few times. Performing the calculation by hand demystifies the process and helps us to understand that computers are useful for monotonous calculations but are no more intelligent than a handheld calculator. The steps to conduct a CA from a cross-tabulation of two categorical variables are as follows: 1. Write the cross-tabulation table within a matrix. For our example, let’s call this matrix M. This matrix is , M=
.
2. Take the sum of every element in this matrix. In our case, this sum is 3,224. Divide every element of the cross-tabulation by this value, and call this matrix P: . . . . . . . P= . . . .
.
.
MATHEMATICS FOR SOCIAL SCIENTISTS
366
3. Compute the row totals of P and store these values in a vector R; then compute the column totals of P, and store these values in a vector C. These two vectors contain the cross-tabulation’s row and column percentages, respectively. Also, create a square matrix Dr with the square roots of the values of R on the diagonal and 0s elsewhere, and create a square matrix Dc with the square roots of the values of C on the diagonal and 0s elsewhere: . . . , , R= C= . . . . . . . , . D = Dr = c . . . . 4. Create a matrix S that is equal to the matrix product: S = Dr (P − RC′ )Dc . In our example, this matrix is equal to . − . S= − . − .
− . . − .
− . . .
.
.
.
5. An important theorem in linear algebra states that any matrix, square or nonsquare, can be written as S = UΣV′ , where U and V are orthonormal—that is, the columns of U are all independent (form right angles with each other when graphed), the columns of V are all independent, and the columns of U and V all have length 1—and where Σ has the same dimensions as S but only has nonzero elements on the diagonal. The product UΣV′ is called the singular value decomposition of S. It turns out that we can find U, Σ, and V for a matrix by working with matrix products, eigenvalues, and unit eigenvectors. To find the singular value decomposition of S, follow these steps:
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
• Calculate SS′ . . − . SS′ = − .
− .
− .
. − .
. .
− .
.
.
.
− . = − .
. − . − .
− . . .
− .
− .
− .
. .
. .
. .
.
.
− .
367
− . − . .
− . . .
.
. ′
• Find the eigenvalues and unit eigenvectors of SS . Unless SS′ is a (2×2) or a (3×3) matrix, it is easier to use a computer to find eigenvalues. In this case, the eigenvalues of SS′ are14 λ = .
,
λ = .
and the matrix of unit eigenvectors is . . − . . − . − . − . − .
,
λ = ,
λ = ,
− . − .
− .
− .
− .
.
− .
. ′
• U is equal to the matrix of unit eigenvectors of SS , and Σ is a matrix with the same dimensions as S where the only nonzero elements are on the diagonal. Replace the diagonal elements with as many eigenvalues as will fit on the diagonal, taking the square root of each eigenvalue. . . − . . − . . − . − . . U= , Σ = − . − . − . − . − . − . . − .
14 The
last two eigenvalues are so small that they are rounded to 0 in this presentation.
.
MATHEMATICS FOR SOCIAL SCIENTISTS
368
• Calculate S′ S. .
S′ S = − .
− .
− .
− .
− .
.
− .
.
.
.
.
. = − .
− .
− .
.
.
− .
.
.
. − . − . − .
− .
− .
.
.
− .
.
.
.
.
• Find the eigenvalues and unit eigenvectors of S′ S. The first several eigenvalues of S′ S will be the same as the eigenvalues of SS′ , but there can only be as many eigenvalues as columns of S′ S. V is equal to the matrix of unit eigenvectors of S′ S. . − . . V= − . . − . . − . . . We have now calculated the singular value decomposition of S. 6. To find the coordinates of the CA for the categories that constitute the rows of the cross-tabulation (constitutional reform in our example), compute the matrix product Dr UΣ. . . . − . − . . . − . − . Dr UΣ = . − . − . − . − . . − . − . . − . . . . . . = − . . − . − . − . − . The first column is the x-coordinate for each category, as illustrated in Figure 10.3; the second column contains the y-coordinates should we desire to derive two latent indices from these data; and the third column contains the z-coordinates should we want to create a three-dimensional graph. It makes sense that the z-coordinates are indistinguishable from zero, because we don’t have enough information with the two categorical variables to calculate the three latent variables. We can multiply these coordinates by − , as that simply changes the way we present these results. Right now,
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
369
the first latent variable represents the weakness of regional governments, which we can see because low levels of the two variables receive the highest values. By multiplying the index by − , we instead describe the latent strength of regional governments. 7. To find the coordinates of the CA for the categories that constitute the columns of the cross-tabulation (fiscal control in our example), compute the matrix product Dc VΣ. If necessary, exclude the rows of all 0s from Σ to make the product conformable: . . − . . Dc VΣ = . − . . − . .
− .
.
. = − . − .
.
.
.
− .
.
− . .
Again, the first column contains the x-coordinates of the categories of fiscal control, as illustrated in Figure 10.3, and the second and third columns contain the y- and zcoordinates if we want to use them.
R
PS ID ● ●
0.002
CRPS
FA ID EC ●● ● ●● PS LM LM ● FC●LMEC● FC R LM ● ● CR CR ● ● FA ● ● ● FA ● ● ●
FA
● R
0.000 −0.004 −0.002
Shared Rule
0.004
0.006
Figure 10.4 Multiple Correspondence Analysis Coordinates for the Categories of the Regional Authority Indices
CR
FCLM EC ● ● ●● ● R
●
●
ID R
PS
● ● ●FA ● ● PS
−0.002
● ● ● ●
0.000
0.002
Low Medium−Low Medium Medium−High High 0.004
Self Rule
Note: ID = Institutional depth, PS = Policy scope, FA = Fiscal autonomy, R = Representation, LM = Lawmaking, EC = Executive control, FC = Fiscal control, CR = Constitutional reform.
0.006
−0.002
●
LM
−0.002 Self Rule
0.000
Self Rule
0.000
0.002
LM
●
0.002
Law Making
●
ID
●
0.004
LM LM ● LM ●
0.004
●
ID
●
PS
Self Rule
0.000
●
PS
0.002
−0.002
●
EC
Self Rule
0.000
●
PS
0.002
Executive Control
−0.002
●
PS
Policy Scope
●
EC
●
●
●
●
●
0.004
●
EC
0.004
●
PS
Low Medium−Low Medium Medium−High High
●
−0.002
●
FC
Self Rule
0.000
Self Rule
0.000
0.002
●●
0.004
0.004
●
FA
FCFC
0.002
●
FA FA ●
Fiscal Control
FA
−0.002
●
FA
Fiscal Autonomy ●
R
R
●
Self Rule
0.000
0.002
R
●
−0.002
●
CR
Self Rule
0.000
0.002
●●
CR CR
Constitutional Reform
−0.002
●
R
● R
Representation
0.004
●
CR
0.004
Note: ID = Institutional depth, PS = Policy scope, FA = Fiscal autonomy, R = Representation, LM = Lawmaking, EC = Executive control, FC = Fiscal control, CR = Constitutional reform.
●
ID
Institutional Depth
Figure 10.5 Separate Multiple Correspondence Analysis Coordinates for the Categories of Each of the Eight Regional Authority Indices
0.006 0.004 0.002
0.006 0.004
0.006 0.004 0.002
0.004
0.002
0.000
−0.004 −0.002
0.006
Shared Rule
Shared Rule
0.004
0.002
0.000
−0.004 −0.002
Shared Rule Shared Rule
0.000 −0.004 −0.002 0.006 0.004 0.002 0.000 −0.004 −0.002
Shared Rule Shared Rule
0.000 −0.004 −0.002 0.006 0.004 0.002 0.000 −0.004 −0.002
Shared Rule Shared Rule
0.002 0.000 −0.004 −0.002 0.006 0.004 0.002 0.000 −0.004 −0.002
370 MATHEMATICS FOR SOCIAL SCIENTISTS
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
371
To plot more than two categorical variables on the same map, researchers have developed a technique called multiple correspondence analysis (MCA) which is similar in spirit to multiple regression. Instead of working with a simple cross-tabulation, MCA builds a symmetric matrix, where the rows and columns represent every category that appears with every variable in the analysis, and the cells contain counts of the number of observations that share each pair of categories. This matrix is similar in spirit to a covariance matrix, so MCA is very similar to PCA. Without delving into the explicit mathematical construction of MCA, let’s apply it to the whole regional authority data set. If we pull out two dimensions, we have to decide what these dimensions mean. Let’s call the first dimension “self-rule” and the second dimension “shared rule.” The map of these coordinates for every category of each index is shown in Figure 10.4. To help you understand these results, Figure 10.5 contains eight maps that plot the coordinates for the indices one at a time. Most of the variables are ordered quite well on the dimension of self-rule, where higher degrees of influence match higher levels of self-rule. Shared rule is a weaker story. Most of the variables are not well ordered on this dimension.
Exercises 1. Solve the following systems of equations by writing them as matrix equations and taking the inverse of the coefficient matrix: − x + y + z = − (a) x − y − z = x− z= − x+ y= (b) − x + y + z = − y− z=− − x+ y−z= (c) x− y− z=− x− z=− x−y+z= (d) − x + y + z = − − x+ y−z=−
372
MATHEMATICS FOR SOCIAL SCIENTISTS
2. Solve the following systems of equations by reducing the augmented matrix: − x + y + z = (a) − x + y − z = − x+ y+ z=− − x− y+z= (b) x + y − z = − x+ y+ z= x−y− z=− (c) − x − y + z = − − y− z= −w + x − y − z = w − x + y + z = − (d) w− x−y− z=− w+ x− y=− 3. Determine whether the following systems have one solution, no solution, or infinitely many solutions: x− y−z=− (a) −x − y + z = − x− y−z=− − x − y − z = (b) x − y + z = − y+ z=
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
− x − y + z = − (c)
x− y+z=− x+ y− z= −x + y − z =
(d)
x−y=− − x+ x+z=
4. The following systems of equations have infinitely many solutions. Find a formula for the solutions of the systems, and report the specific solution of each system. x+ y+z= (a) −x + y + z = x+ y+z= x+y+ z=− (b) − x + y − z = − − x+ y− z=− x + y − z = − (c) x+ y+z=− − x+ y− z= x− y− z= (d) x − y − z = x+y−z= 5. Find the eigenvalues of the following matrices, and determine whether each matrix is positive-definite, negative-definite, or neither. Also, find the unit eigenvectors associated with each nonzero eigenvalue. − − (b) (c) (d) (a) − − − −
373
MATHEMATICS FOR SOCIAL SCIENTISTS
374
(e) (i)
− −
− − −
(f)
(g)
− (j)
−
(h)
−
− − 6. Consider the multivariate function f(x,y) = xy + x + y + x + y . (a) Find the gradient of f(x,y). (b) Find the Hessian of f(x,y).
( (c) Prove that the critical points of f(x,y) are (− , ) and −
,
) .
(d) Plug each critical point into the Hessian, and derive the eigenvalues of the resulting matrix. Use this information to decide whether each critical point represents a local maximum of the function, a local minimum, or a saddle point. 7. For each of the following matrices, − , A= B= , − − − , C= D= , − − , , E= F= − − (a) report the following quantities: • the trace, • the determinant, • the eigenvalues. (b) What is the relationship between the trace and the eigenvalues of a matrix and between the determinant and the eigenvalues of a matrix? 8. Consider the following matrix: − − − . A= − − − − − − The eigenvalues of this matrix are λ = . , λ =− . ,
λ =− . , λ =− . .
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
The matrix of unit eigenvectors for this matrix is . . Q= − . .
. − . . .
− . . − . .
375
. . . − .
,
where the first column is the unit eigenvector that corresponds to λ = . , the second column is the unit eigenvector that corresponds to λ = − . , and so on. The inverse of this matrix is . − . − . . − . . . . . Q = . . . − . − . . . . A theorem in linear algebra states that if A is a square matrix of full rank, then A = QBQ− , where Q is the matrix of unit eigenvectors and B is a square matrix in which the diagonal elements are the eigenvalues of A and the off-diagonal elements are all 0. A square, nonsingular matrix written in this way is called an eigendecomposition. Eigendecompositions are very useful for simplifying other calculations, such as finding the inverse of a matrix15 or a higher power of a matrix.16 For matrix A above, prove that A = QBQ− . 9. The concept of social capital within a population was defined by Robert Putnam as the “collective value of all ‘social networks’ and the inclinations that arise from these networks [for people] to do things for each other.”17 Sociologists and economists have found that social capital has a profound effect on many societal outcomes. For example, Stephen Knack and Philip Keefer show that higher levels of social captial are associated with higher incomes in a country and greater levels of wage equality.18 They measure social capital with two observed indices. First, they measure trust with the following survey question: Generally speaking, would you say that most people can be trusted or that you can’t be too careful in dealing with people?
corollary of the eigendecomposition theorem states that A− = QB− Q− . That is, if we know the eigenvalues of A along with the matrix of unit eigenvectors and the inverse of the matrix of unit eigenvectors, all we need to do to calculate A− is find the inverse of B. It turns out to be really easy to calculate B− : Since this matrix is zero everywhere except the diagonal, the diagonal elements of B− are just the reciprocals of the diagonal elements of B. 16 Another corollary of the eigendecomposition theorem states that An = QBn Q− . It’s easy to calculate Bn : Since this matrix is zero everywhere except the diagonal, the diagonal elements of Bn are just the the diagonal elements of B each taken to the power of n. 17 Putnam, R. (2000). Bowling alone: The collapse and revival of American community. New York, NY: Simon & Schuster. 18 Knack, S., & Keefer, P. (1997). Does social capital have an economic payoff? A cross-country investigation. Quarterly Journal of Economics, 112(4), pp. 1251–1288.
15 A
MATHEMATICS FOR SOCIAL SCIENTISTS
376
The second observed index that Knack and Keefer use to measure social capital is civic cooperation, which they build from responses to questions about whether each of the following behaviors “can always be justified, never be justified or something in between.” (a) (b) (c) (d) (e)
“claiming government benefits which you are not entitled to” “avoiding a fare on public transport” “cheating on taxes if you have the chance” “keeping money that you have found” “failing to report damage you’ve done accidentally vehicle”
to
a
parked
Suppose that we conduct a survey and that we use Knack and Keefer’s indices to measure social capital with a PCA. Suppose further that we’ve measured each index on a 1–10 scale where higher numbers indicate more trust and more civic cooperation. We survey 10 people and acquire the following data: Observation Trust Civic Number Cooperation 1
4
2
2
7
10
3
3
6
4
8
9
5
9
7
6
2
3
7
9
6
8
5
3
9
6
4
10
8
8
(a) Describe the characterization, representation, and measurement in this measurement model. (b) The covariance matrix of the data is . .
.
.
.
Find the eigenvalues and unit eigenvectors of the covariance matrix. (c) How much of the variance is explained by the larger eigenvalue?
CHAPTER 10. LINEAR SYSTEMS OF EQUATIONS AND EIGENVALUES
377
(d) Use the results of the PCA to create an index for social capital. 10. We will conduct a CA for the following cross-tabulation of data from a health behavior survey: Regular Smoker? No
Yes
Fewer than three 452 174 Drinking
drinks per week Three or more
82
292
drinks per week But we will proceed in steps. Please refer to the steps outlined in Section 10.5.2 for definitions of the notation in the following steps. (a) Write the cross-tabulation as a matrix, and calculate the matrix P. (b) Calculate R, C, Dr , and Dc . (c) Calculate S = Dr (P − RC′ )Dc . (d) Calculate SS′ and S′ S. (e) Calculate the eigenvalues of SS′ , and calculate Σ. (f) The matrix of unit eigenvectors of SS′ is − . − . , U= . − . and the matrix of unit eigenvectors of S′ S is − . − . V= . − .
.
Given these matrices and your calculation for Σ, prove that UΣV′ comprises the singular value decomposition of S. (g) Calculate the coordinates for the first and second latent variables that underlie this cross-tabulation. (h) Create a plot of the points you calculated in (g). Which category is most unlike all of the others?
Conclusion: Taking the Math With You As You Proceed Through Your Program
I
you are a new student in a social science graduate program, you are using this book to shore up your math skills in preparation for advanced courses in game theory or statistics. For your graduate research, you are going to spend countless days, months, and years researching a specific topic in the social sciences and developing theories about the way the world works. You will also spend countless days, months, and years collecting realworld observations that inform you about the topic. You will hold yourself to high standards regarding the quality of theory and data, and you would be wise to hold yourself to equally high standards in designing statistical models to fit the data and test the theories. The difference between using statistical models and being in control of statistical models is the ability to engage with the mathematical underpinnings of the models. It isn’t necessary to take a semester-long course to simply use statistical models: In a day or two, you would be able to teach yourself about the most commonly used commands in your favorite statistical software package, and you would be able to run many different kinds of models by following the examples in the help documentation. But if you take this approach, you will have to put a lot of faith in these programs, which were written without any regard for your own research project and its nuances. You might not be able to recognize the problems that arise, and you might not know how to fix them if you do. But being familiar with the math that underlies the models will help you understand how the models really work. Formulas and prescriptions will seem less arbitrary. You’ll become more confident as a researcher. You’ll be able to design your own model from the same mathematical ingredients and make a confident case that this method is the best one for your purpose. Below, I list several courses that are frequently taught in various social science departments. Each course builds on a mathematical foundation, but they use and emphasize different topics. Before starting on one of these courses, it will be useful to review the most relevant mathematics topics. General Mathematical Notations. Every one of these courses will engage with literature that presents ideas using various mathematical notations. Regardless of the course you will be taking, a review of summations and long products (Section 1.6), set builder and interval notation (Sections 2.1 and 2.2), Newton’s and Leibniz’s notations for derivatives (Section 4.7), integral notation (Section 6.3), the notation for partial derivatives (Section 7.3.1) and multiple integrals (Section 7.4.1), and matrix notation (Sections 8.1–8.4) will prepare you to read the texts and articles. You will also frequently see Euler’s constant e (Section 4.4) used in formulas without an explanation that this number is irrational, is about 2.71828, and is the most important number in all of science. 379
MATHEMATICS FOR SOCIAL SCIENTISTS
380
Probability. The first course in quantitative methodology in many programs is a course on probability. In this course, you will learn about probability functions and Bayes’ rule (Chapter 3) as well as hypothesis tests such as the student’s t, z, F, and χ tests. These tests are are built on probability density functions (PDFs) (Section 6.6). You will also learn the connection between density and probability, and you might be expected to calculate an integral of a PDF to compute a CDF (cumulative distribution function). You will also learn about moments such as the mean and the variance (Section 6.7). In some courses, you will also learn about joint, marginal, and conditional PDFs, along with covariance and correlation (Section 7.4.4). Linear Regression and Analysis of Variance. Commonly, social science graduate students will take a course in their second semester that focuses exclusively on either linear regression (Section 2.5.3) or analysis of variance (ANOVA) methods. It turns out that these two methods, by and large, are equivalent, and they present the same information in different ways.19 For both approaches, you will need to have a solid foundation in probability and moments. Regression models are often best understood in terms of a marginal effect: the first partial derivative (Sections 4.6–4.9, 7.3.1, and 7.3.2) of the expected value (Section 6.7) of the outcome. Some regression models require that the outcome or an independent variable, or both, be transformed with a logarithm (Section 1.5). To understand the formulas that are used to calculate coefficient estimates (Section 7.3.4), it will be necessary to have a firm understanding of summations and multivariate optimization (Section 7.3.3). Regression coefficients cannot be estimated when variables are perfectly collinear, a concept that is the same as linear dependency, causing a data matrix to be singular, and less than of full column rank (Section 9.4). Finally, to understand the matrix version of the OLS (ordinary least squares) estimator (Section 9.3), you will need to understand basic matrix operations such as the transpose (Section 8.3), matrix multiplication (Section 8.4), and matrix inverses (Sections 9.1 and 9.2). Generalized Linear Models or Maximum Likelihood Estimation. The course on generalized linear models (GLM) or maximum likelihood estimation (MLE) usually follows a course in linear regression. A GLM is an extension of the linear regression model and is used to analyze the variance of limited dependent variables, such as binary, ordinal, unordered categorical, count, truncated, and duration outcomes. A GLM is estimated using a process called MLE by writing out a likelihood function for the data. A likelihood function is related to a joint PDF (Section 7.4.4) for the dependent variable, and is usually written as the product of the PDFs for each observation in the data using a long product (Section 1.6). Researchers usually find it easier to work with the log-likelihood function: the natural logarithm (Section 1.5) of the likelihood function. To obtain results, you will find the values of parameters that maximize the log-likelihood function (Sections 5.1, 5.2, and 7.3.3). You will probably be asked to perform some of these calculations by hand, but in practice researchers rely on computer hill-climbing algorithms such as the Newton-Raphson method (Section 5.3) to optimize a complex log-likelihood function. As with linear regressions, the results of GLMs are best understood as marginal effects: the partial derivative 19 For
a detailed discussion of why these two methods are so closely related, see Gelman, A. (2005). Analysis of variance: Why it’s more important than ever. The Annals of Statistics, 33(1), 1–53.
CONCLUSION
(Sections 4.6–4.9, 7.3.1, and 7.3.2) of some component of the model with respect to a variable of interest. Time-Series, Time-Series Cross Section Methodology, and Multilevel Models. These important topics are sometimes taught in one seminar, and sometimes in different seminars. Time series usually refers to data for which the units of observation are many consecutive points in time. Some important time series in the social sciences are daily stock market indices, monthly presidential approval, and weekly crime statistics. Time-Series Cross Section (TSCS) methods, also called pooled time series and panel methods, work with data that contain a panel that is observed at many consecutive time points. Longitudinal studies and wave studies, as well as data sets that contain observations of countries over time or firms during a time span, are also TSCS methods. Multilevel models, also called hierarchical linear models, do not necessarily contain time, but do contain nested levels of identification. A classic example of multilevel data considers students within classrooms within schools, in which a theory posits that factors that vary across students, across classrooms, and across schools can all affect a student’s test scores. All three topics build off of the topics covered in linear regression and GLM courses. In addition, you will benefit by reviewing limits (Sections 4.1– 4.3) to help you understand the long-term behavior of a time-series model. Some time series are integrated, meaning that they contain the summation of all shocks to the system since the beginning of the series, and it will be useful to review the connection between summations and integrals (Sections 6.1–6.3). Certain TSCS and multilevel models employ terms called random effects, which must be integrated out (Sections 6.4 and 6.5) of a likelihood function to derive coefficient estimates. One way to model a time series is as a dynamical system, whose long-term behavior is best understood by calculating eigenvalues and eigenvectors (Section 10.4). Bayesian Statistics. Bayesian statistics is an alternative approach to statistical inference that contrasts with the frequentist focus of linear regression and GLM, although regressions and GLMs can be implemented as Bayesian models. To understand this approach, you will need a firm understanding of probability and Bayes’ rule (Chapter 3). A great deal of the focus of a course in Bayesian statistics is on the estimation of a Bayesian model using a computer algorithm called Markov chain Monte Carlo (MCMC) simulation, which is related to the concept of an integrated time series. The goal of MCMC is to express the entire joint probability distribution (Section 7.4.4) of all of the parameters in the model. One version of MCMC, called a Gibbs sampler, repeatedly draws simulated values of each parameter from each parameter’s conditional distribution. A second approach, called Metropolis-Hastings, uses a second PDF to propose new values for the parameters. Measurement. Statistical measurement considers variables that are latent—that is, variables that exist but cannot be directly observed, though they have implications that are observable. A latent variable is measured from its observable implications. An especially important topic in measurement is eigenvalues and eigenvectors (Section 10.4). Some measurement techniques, such as principal components analysis (Section 10.5.1), involve little more than the calculation of the eigenvalues of a covariance matrix (Section 7.4.4). Many measurement methods are estimated through simulation techniques: Frequentist models are often estimated using the expectation-maximization (EM) algorithm in which structural parameters are es-
381
382
MATHEMATICS FOR SOCIAL SCIENTISTS
timated with MLE given a particular measurement of the latent variable, after which the latent variable is measured with the latest likelihood. Bayesian measurement models estimate structural parameters and the latent variable simultaneously using MCMC. Game Theory. Game theory is the study of how actors make decisions rationally in a context where an actor’s payoff for an action depends on the actions of other actors. When no actor has an incentive to depart from a particular strategy, that strategy is called a Nash equilibrium, and if the assumptions of the game are not too restrictive or unrealistic, then a Nash equilibrium may describe reality. Unlike all of the topics discussed above, game theoretical models do not require any data. To construct a game and derive an equilibrium, first you will write down who the actors are and what their possible actions are. You will then write down utility functions for each actor: these functions (Sections 2.4 and 7.1) take as inputs the set of possible actions for every actor and map those actions to utilities. Your goal is to determine the action that has the maximum utility for the actor. To maximize a utility function, you will take the derivative of that function (Sections 4.6–4.9) and apply the rules for finding a global maximum (Chapter 5). Some games with greater complexity may consider situations where one actor must make more than one decision, requiring multivariate optimization (Sections 7.3.1–7.3.3), or may optimize utility under constraints (Section 7.3.5), or under uncertainty, requiring Bayes’ rule to represent beliefs. Network Theory. Many social processes and phenomena can be represented as networks. Networks are visual and consist of dots called nodes or vertices, which may or may not be connected to other vertices with straight lines called links or edges. The goal of a network model is to represent connection. In the social sciences, networks can be used to represent the financial connectivity between firms and banks, cosponsorships on bills in Congress, the spread of a contagious disease in a population, and so on. One important concept in network theory is centrality, measured by computing the eigenvalues (Section 10.4) of a matrix that represents all of the connections in the network. Another important concept is diffusion, the speed with which information or contagion moves through the network. Since diffusion is a speed, we measure it with derivatives (Sections 4.5–4.9, 7.3.1, and 7.3.2). To conduct statistical inference on the structure of the network, you will employ an exponential random graph model (ERGM) that is similar to a GLM and builds on joint PDFs (Sections 6.6 and 7.4.4) but dispels the assumption made by MLE that observations are conditionally independent.
Index
Addends, 5 Addition, 6–8, 277 Adjoint matrix, 304–307 Algebra, review of, 3–39 equations and inequalities, 25–24 exponents, 10–11 fractions, 6–10 logarithms, 14–17 numbers, 3–6 roots, 11–13 summations and products, 17–25 Algorithms, convergence, 154–155 α parameter, 312 Analysis of variance (ANOVA), 380 Antiderivatives, 161–162. See also Integration Arithmetic, 4–5. See also Matrices Asymptotes, continuity and, 117–119 Augmented matrix, 332, 354–355, 357 Axes on graphs, 55 Basis, vectors forming, 322n Bayesian statistics, 381 Bayes’ rule, 77, 95–99 Beliefs, quantifying, 97 Best-fit line: in linear regression, 66, 68–70 in multiple regression, 311 partial derivatives and, 226–233 β parameter, 312 Binomial distribution, 91–92 Binomials (combinations), 88–89 Bivariate standard normal probability density function (PDF), 207–208 Block diagonal matrices, 275 Bounds, 233, 238–240
Calculus. See Derivatives; Integration; Limits; Multivariate calculus Cases in data sets, 63 CDF (cumulative distribution function), 185–187, 247–248 Chain rule for derivatives, 134–141 Characteristic equation of matrix, 347, 352 Characterization of variables, 358–359 Clay Mathematics Institute, 164n CMF, joint (joint cumulative mass function), 248 Coefficients in polynomials, 58 Cofactor matrix, 306–307 Collinear variables, 319 Column number of matrix elements, 272 Column vectors, 219, 273 Combinations (binomials), 88–89 Combining like terms, 26–27 Common logarithms, 14–15 Commutativity property: long products, 22–23 matrix multiplication and, 281 summations, 19–20 Comparative statics, 123–124 Complement events, 85–86 Complements of sets, 42 Composite numbers, 5 Compositions of functions. See Chain rule Conditional distributions, 251–253 Conditionally independent events, 83 Conditional probability, 92–94 Conformability of matrices, 282–283 Constants, 6, 58 383
MATHEMATICS FOR SOCIAL SCIENTISTS
384
Constrained optimization, 233–234 Continuity and asymptotes, 117–119, 212 Continuous functions, differentiable, 126–128 Continuous random variables, 182 Continuous sets, 78 Controlling for variables in regression, 71–72 Convergence algorithm, 154–155 Correlation, 257–262 Correspondence analysis, 362–371 cross-tabulation in, 363–364 eigenvalues and eigenvectors in, 366–369 mapping in, 364–366 multiple, 371 multiple coordinates in, 369–370 overview, 359 Correspondence between sets, 51 Correspondence of functions, 206 Cosines, 291–292 Counting theory, 87–89 Covariance (Cov): definition of, 194–195 joint probability distributions and moments, 253–257 Critical points: maxima and minima, 151–153 multivariate function, 222, 225, 228 Newton-Raphson method for, 156 parabola, 62 Cross-tabulation, 363–364 Cube roots, 11–12 Cubic functions, 62 Cumulative distribution function (CDF), 185–187, 247–248 Cumulative mass function, joint (joint CMF), 248 Curvilinear effect in regression, 70–71 Data points, 64 Data sets for linear regression, 63–65 Definite integral, 169, 173–175 Definite multiple integrals, 241–244 Degree of polynomials, 58
Denominator of fractions, 6–7 Derivatives: chain rule for, 134–141 definition of, 124–128 notation for, 129–130 shortcuts for finding, 131–133 See also Limits; Optimization; Partial derivatives Determinants: definition of, 304 larger square matrix, 308–311 Diagonal matrices, 274 Difference-of-cubes formula, 30 Difference-of-squares formula, 29 Differentiable continuous functions, 126 Dimensions of matrices, 272 Discrete random variables, 182 Discrete sets, 78 Distributions: cumulative distribution function, 185–187 exponent, 24 exponential function of base e as normal, 122–123 factoring and, 27–29 joint multivariate normal, 247 posterior, 97 uniform across outcomes, 80 See also Joint probability distributions and moments; Moments of probability distributions Divided exponents, 12 Division, 9–11. See also Matrix inverses Domain and range, 56–58 Domains, bounded, 150–151 Dot product of two vectors, 280 DW-NOMINATE software, 359 e (natural logarithm base): derivatives of exponential functions with base, 133 description of, 14–15 Euler’s constant, 379 normal distribution as exponential function of, 122–123
INDEX
Eigenvalues: description of, 294 finding, 347–349 in correspondence analysis, 366–369 overview, 345–346 positive-definite and negative-definite matrices, 350–354 Eigenvectors: description of, 294 finding, 354–357 in correspondence analysis, 366–369 overview, 345–346 positive-definite and negative-definite matrices, 350–354 Elementary row and column operations on matrices: eigenvectors found by, 354–355 linear systems of equations solved by, 332–335 steps in, 295–297 Elements: matrix, 219, 272 set, 41 EM (expectation-maximization) algorithm, 381–382 Empty sets, 41 Equally likely outcomes, probability of, 79–81 Equations and inequalities: distribution and factoring, 27–29 inequalities, 32–34 isolating variables, 25–27 quadratic equations, 31–32 See also Linear systems of equations and eigenvalues ERGM (exponential random graph model), 382 Euler, Leonhard, 122–123, 379 Even numbers, 4 Events: complement, 85–86 conditionally independent, 83 independent, 83–85 multiplication of stages in, 87 sample spaces and, 77–78
385
unions of, 82–83 Expectation-maximization (EM) algorithm, 381–382 Expected values, 188–193, 253 Exponential functions: derivatives of, 133 integration of, 171 logarithms for, 14–16 Exponential random graph model (ERGM), 382 Exponentiation property of long products, 23 Exponents: degree of polynomials as, 58 distribution of, 24 divided, 12 fraction division by, 9 nondistribution of, 21–22 review of, 10–11 sum of, 24–25 Factorials, 87–88 Factoring: canceling terms as, 8 distribution and, 27–29 fraction division by, 9 long products, 24 summations, 20–21 False-positive rates, 96 Finite sets, 42 First derivative test, 151–152 First fundamental theorem of calculus, 170, 175 First partial derivative, 215–216 FOIL problems, 29–30 Fractions, 6–10, 56 Free variables, 338 Frequentist models, 77, 381 Full column rank matrices, 322 Full row rank matrices, 322 Function compositions, 52–54 Function inverse, 53–54 Functions. See Derivatives; Limits; Sets and functions
MATHEMATICS FOR SOCIAL SCIENTISTS
386
Game theory, 148, 382 Gauss, Carl Friedrich, 164 Gaussian quadrature, 164 Gauss-Jordan elimination operations, 295–297 Generalized linear models (GLE), 380–381 General mathematical notation, 379 Gibbs sampler, 381 Global maxima and minima, 150, 234–237 Gradients and Hessians, 218–224 Graphs: intervals on, 47–48 linear functions in, 59–60 sets and functions as, 55–56 Harlan, John Marshall, Jr., 358 Hessians, gradients and, 218–224 Higher-order polynomials, 61–62, 70–72 Hofert, Marius, 250n Homogenous systems of linear equations, 342–344 Horizontal asymptotes, 118–119 Hyperspheres, 206n Identity matrices, 274–275 IID (independent and identically distributed) observations, 196 Improper fractions, 7 Improper integrals, 180–181 Indefinite integrals, 161, 170–173 Indefinite multiple integrals, 244–245 Independent and identically distributed (IID) observations, 196 Independent events, 83–85 Index variables, 17 Indirect variables. See Statistical measurement models Inequalities. See Equations and inequalities Infinite sets, 43, 45 Infinitesimal values, 168 Infinite values, 115 Inner product to multiply vectors, 280 Instantaneous rate of change, 125
Integers, 3–4 Integrands, 169 Integration, 161–204 definitions of, 161–163 moments, 188–196 expected values, 189–193 first, 188–189 standard deviation derivation, 195–196 variance derivation, 193–195 notation for, 168–169 probability density functions, 182–187 Riemann sums, 163–168 solving, 170–176 definite, 173–175 improper integrals, 180–181 indefinite, 170–173 integration by parts, 178–180 overview, 170 u-substitution, 176–178 See also Multivariate calculus Intersection of sets, 42 Intervals: 95 percent confidence, 187, 316–317 set and function, 45–48 Inverses: function compositions and, 52–54 operations with, 11 See also Matrix inverses Irrational numbers, 3 Iterations, 154 Jacobelis v. Ohio (1964), 357 “Jailbreak method,’’ for radicals, 13 Joint cumulative mass functions (joint CMFs), 248 Joint multivariate normal distribution, 247 Joint probability distributions and moments: conditional distributions, 251–253 correlation of, 257–262 covariance of, 253–257 cumulative distribution function, 247–248
INDEX
definition of, 245–246 expected values of, 253 marginal distributions, 248–251 normal distribution, 246–247 Joint probability mass functions (joint PMFs), 248 Kronecker product of matrices, 278–279 Kurtosis of probability density functions, 196 Lagrange multipliers, 233–237 Larger square matrix inverses: adjoint matrix, 305–307 determinants, 308–311 Latent variables: measuring, 364 principal component analysis and, 360–362 statistical test of association and, 362 Law of large numbers, 188 LCD (lowest common denominator), 7–8 Left factor matrix, 281 Left inverse of matrix, 303 Left limits of functions, 112 Left Riemann sum, 164–166 Leibniz, Gottfried Wilhelm, 129–130, 134, 214, 379 Like terms, combining, 26–27 Limits: continuity and asymptotes, 117–119 definition of, 111–117 multivariate calculus, 208–212 number e, 122–123 point estimates and comparative statistics, 123–124 solving, 119–121 See also Derivatives Linear algebra. See Matrices Linear dependency of matrices, 319–320 Linear functions, 59–60 Linear regression, 63–72 best-fit line from, 68–70 data sets for, 63–65 higher-order polynomial models, 70–72
387
partial derivatives and, 226–233 residuals from, 67–68 scatterplots from, 65–66 social science sources on, 380 Linear systems of equations and eigenvalues, 329–377 eigenvalues and eigenvectors, 345–356 homogenous systems, 342–344 nonsingular coefficient matrices, 330–335 overview, 329–330 singular coefficient matrices, 335–342 statistical measurement models, 357–371. See also entries for individual models Logarithms: derivatives of, 171–172 derivatives of natural, 133 domains of, 57 review of, 14–17 Long products, 22–25 Lower bounds of intervals, 46 Lowest common denominator (LCD), 7–8 Lowest terms of fractions, 6 Mapping: correspondence analysis, 364–366 functions as, 51–52 multivariate functions as, 206–207 Marginal distributions, 248–251 Markov chain Monte Carlo (MCMC) simulation, 381 Marshall, Thurgood, 358 Matrices, 271–302 arithmetic operations on, 276–281 definition of, 218–219 elementary row and column operations on, 295–297 Hessians as, 219–224 linear dependency of, 319–320 multiplication of, 281–285 nonsingular coefficient, 330–335 notation for, 271–273
MATHEMATICS FOR SOCIAL SCIENTISTS
388
positive-definite and negative-definite, 222–223, 350–354 rank of, 320–323 singular coefficient, 335–342 singularity of, 318–319 types of, 273–275 vectors and transformation, 285–294 See also Linear systems of equations and eigenvalues Matrix inverses, 303–317 larger square matrix 305–311 multiple regression and OLS estimator, 311–317 2 × 2 matrix, 303–305 Maxima and minima: finding, 151–153 global, 150, 234–237 Lagrange multipliers for, 234–237 maximum likelihood estimation (MLE) methods, 147–148 multivariate function, 222 Newton-Raphson method, 154–157 terminology, 148–151 Maximum likelihood estimation (MLE) methods, 147–148, 154, 380–381 MCA (multiple correspondence analysis), 371 MCMC (Markov chain Monte Carlo) simulation, 381 Mean, first moment of probability function as, 188 Measurement of variables, 358–359, 381–382 Method of moments. See Moments of probability distributions Midpoint Riemann sum, 164–165, 167 Millennium problems, 164n Minima. See Maxima and minima Minor elements in adjoint matrix, 305–306 Missing data points, 64 Mixed numbers, 7 MLE (maximum likelihood estimation) methods, 147–148, 154, 380–381
Moments of probability distributions, 188–196 expected values, 189–193 first, 188–189 standard deviation derivation, 195–196 variance derivation, 193–195 See also Joint probability distributions and moments Multilevel models, 381 Multiple correspondence analysis (MCA), 371 Multiple integrals: definite, 241–244 indefinite, 244–245 joint probability distributions and moments, 245–257 notation, 237–241 Multiple regression: OLS estimator and, 311–317 overview, 71–72 Multiplication: event stages, 87 exponent, 10–11 fraction, 8–9 matrix, 281–285 Multiplication property of summations, 20 Multivariate calculus, 205–267 functions in, 206–208 limits in, 208–212 multiple integrals, 237–262 partial derivatives, 212–236 Mutually exclusive sets, 42 Nash equilibrium, 382 Natural logarithms: derivatives of, 133, 171–172 e as base of, 14–15 Natural numbers, 3 Negative-definite matrices, 222–223, 350–354 Negative exponents, 10 Negative real numbers, 3 Negative-semidefinite matrix, 352n Negative slope in linear function, 59
INDEX
Network theory, 382–383 Newton, Isaac, 129–130, 214, 379 Newton-Raphson method, 154–157 95 percent confidence interval, 187, 316–317 Nondistribution property of summation exponents, 21–22 Nonsingular coefficient matrices, 330–335 Normal distribution: exponential function of base e as, 122–123 joint multivariate, 247 joint probability distributions and moments and, 246–247 nth moment of probability density functions (PDF), 192 n-tuple integral, 238 Null hypothesis, statistical significance tests on, 187n Numbers, 3–7 Number theory, 164 Numerator of fractions, 6 Numerical method, area under curve approximated by, 163–164 Observations in data sets, 63 Obviously untrue statements, 335–336 Odd numbers, 4 OLS (ordinary least squares) estimator, 231n, 311–317 Omitted variables, 71 One, exponent of, 10 Optimization, 147–160 Lagrange multipliers and, 233–234 maxima and minima, 151–153 maximum likelihood estimation (MLE) methods, 147–148 Newton-Raphson method, 154–157 partial derivatives and, 222–226 terminology, 148–151 Ordered n-tuple, 206 Ordered pairs of numbers, 206 Order of operations, in arithmetic, 4–5
389
Ordinary least squares (OLS) estimator, 231n, 311–317 Outcomes: equally likely, 79–81 in set theory, 77 uniform distribution across, 80 Outer product to multiply vectors, 280–281 Parameters of models, 66–67 Partial derivatives, 212–236 best-fit line for linear regression, 226–233 definition and notation, 212–218 gradients and Hessians, 218–221 Lagrange multipliers, 233–237 optimization, 222–226 Partitions: matrix, 275 Riemann approximations and, 166–169 sample space, 93–94 Parts, integration by, 178–180 PCA (principal component analysis), 294, 359–362 PDF (probability density functions). See Probability density functions (PDF) PEMDAS (arithmetic order of operations), 4 Perfect square, quadratic expression as, 29 Permutations, 88–89 π, 123 PMF (probability mass functions), 78, 182, 188 Point estimates and comparative statics, 123–124 Polynomials, 58–76 higher-order, 61–62 linear functions and graphs, 59–60 linear regression, 63–72 Positive-definite matrices, 222–223, 350–354 Positive real numbers, 3
MATHEMATICS FOR SOCIAL SCIENTISTS
390
Positive-semidefinite matrix, 352n Positive slope in linear function, 59–60 Posterior distribution, 97 Power functions, 14, 16 Power rule for derivatives, 131 Prime factorization, 5 Prime numbers, 5 Principal component analysis (PCA), 294, 359–362 Probability, 77–107 Bayes’ rule, 95–99 conditional, 92–94 counting theory, 87–89 events and sample spaces, 77–78 probability function properties, 78–86 sampling problems, 90–92 social science sources on, 380 See also Moments of probability distributions Probability density functions (PDF): bivariate standard normal, 207–208 definition of, 78 integration of, 182–187 kurtosis of, 196 skewness of, 196 Probability distributions. See Joint probability distributions and moments Probability mass functions (PMF), 78, 182, 188 Product rule for derivatives, 132, 137 Products, 16. See also Summations and products Quadratic equations, 31–32. See also Polynomials Quartic functions, 62 Quotient, logarithm of, 16 Quotient rule for derivatives, 132–133, 137–138 Radicals, roots as, 12–13 Random variables, 182 Range, domain and, 56–58 Rank of matrices, 320–323 Rational numbers, 3, 45
Real numbers, 3 Reciprocal functions, 133 Reciprocals of numbers, 6, 16–17 Reduced radical form of roots, 13 Region of integration, 238, 240 Relative extrema, 150 Relative maxima and minima, 148–149 Replacement, sampling with and without, 90–93 Representation of variables, 358–359 Residuals: linear regression, 66–68 sum of squared, 68–71, 227, 231, 311 Riemann, Bernhard, 164 Riemann sums, 163–168, 240 Right factor matrix, 281 Right inverse of matrix, 303 Right limits of functions, 112 Right Riemann sum, 164–166 Roots: convergence algorithms and, 155–156 function, 62 review of, 11–13 square, 11, 56, 133 Row number of matrix elements, 272 Row vectors, 219, 272 Saddle points in functions, 148–149, 152 Samples: events and, 77–78 probability, 90–92 replacement in, 90–93 Sarrus’ rule, 309–310 Scalar matrices: definition of, 273–274 multiplication of, 277–278 Scalars, 206 Scaling measurement, 359 Scatterplots, 65–66 Secant lines, 126 Second derivative test, 153 Second fundamental theorem of calculus, 173 Second partial derivative, 216–217
INDEX
SER (standard error of the regression), 316 Set builder notation, 43–46 Sets and functions, 41–76 domain and range, 56–58 function compositions and inverses, 52–54 graphs, 55–56 intervals, 45–48 mappings as, 51–52 multivariate calculus, 206–208 polynomials, 58–76 set notation, 41–45 Venn diagrams, 48–51 See also Probability Simple regression, 71 Singular coefficient matrices, 335–342 Singularity of matrices, 318–319 Singular value decomposition, 366 Skewness of probability density functions (PDF), 196 Slope in linear function, 59 SOAP problems, 30 Specific solutions, 339 Spheres, 206n Square matrices, 273 Square roots, 11, 56, 133 SSR (sum of squared residuals), 68–71, 227, 231, 311 Stack Overflow website, 250n Standard deviation, 195–196 Standard error of the regression (SER), 316 Stata statistical program, 154 Statics, comparative, 123–124 Statistical measurement models: correspondence analysis, 362–371 principal component analysis, 360–362 quantifying indirect variables, 357–360 Statistical significance, 95 percent confidence interval for, 187, 316–317 Stewart, Potter, 357–358
391
Subsets, 42 Subtraction, 6–8, 277 Summations and products: long product properties, 22–25 summation properties, 19–22 uses of, 17–19 Sum-of-cubes formula, 30 Sum of squared residuals (SSR), 68–71, 227, 231, 311 Symbols, algebraic, 6 Symmetric Hessian matrices, 220 Symmetric matrices, 273–274 Tangent lines, 126 Tensor product of matrices, 278–279 3 × 3 matrices, 309–310 Time-series and time-series cross section (TSCS) methodology, 381 Trace of matrix, 276–277 Transformation matrices: altering vectors with, 293–294 angle between vectors, 291–292 geometric representation of vectors, 285–289 orthogonal vectors, 292 unit vectors and, 289–291 Transpose of matrix, 276 Trapezoidal Riemann sum, 164–165, 167–168 Triangular matrices, 274 Trivially true statements, 336, 338 Trivial solutions, 343–344 True-positive rates, 96 TSCS (time-series cross section) methodology, 381 2 × 2 matrix, inverse of, 303–305 Uniform distribution across outcomes, 80 Union of sets, 42 Unions of events, 82–83 Unit eigenvector, 356 Unit of analysis, 63 Universal sets, 42 Upper bounds of intervals, 46 U.S. Supreme Court, 357–358
MATHEMATICS FOR SOCIAL SCIENTISTS
392
u-substitution for solving integrals, 176–178 Utility function, 148 Variables: collinear, 319 controlling for, 71–72 definition of, 6 free, 338 in data sets, 63 index, 17 isolating, 25–27 omitted, 71 random, 182 See also Statistical measurement models Variance, 193–195 Vectors: angle between, 291–292 basis formed by, 322n
definition of, 219, 272–273 geometric representation of, 285–289 multiplication of, 280–281 orthogonal, 292 transformation matrices to alter, 293–294 unit, 289–291 Venn diagrams, 48–51, 82 Vertex of parabolas, 61 Vertical asymptotes, 118 Vertical line test of functions, 56 Volume, 238, 240 Whole numbers, 3 x-axis on graphs, 55 y-axis on graphs, 55 y-intercept in linear function, 59 Zero, exponent of, 10 z-score, 196